ActiveTigger@PyData2025

Collaborative Text Annotation Tool for Computational Social Sciences

Julien Boelaert - Paul Girard - Étienne Ollion - Émilien Schultz
CREST/GENES

2025-09-30

Text as data in social sciences

  • Explosion of text content (social media, news, …)
  • Adoption of NLP methods from CSS & DH

Example from IC2S2 2025

BERT, GPT and proliferation of methods

Rapid adoption of model-based treatments

  • Encoders with BERT (2018)
  • Decoders with GPT (2022)
  • Proliferation of closed & open models

Consequences :

  • diversity of new unstabilized methods
  • needs of specific ressources (coding skills, GPU)

For what kind of uses ?

  • What are the stances on public opinion on global warming ? 1
  • What is the prevalence of gender-based analyzes in the French Social Sciences? 2
  • What are the circumstances of transmission of an emerging infectious disease from survey open answers ? 3

Annotate > 50000 abstracts on concepts

Existing annotation tools

Lots of tools1 (Doccano, Label studio, Argil, Inception, …)

  • Primarily designed for computer science/ML
    • Diversity of tasks (classification, span, …)
  • Far from non-expert user xp (social scientists, journalists, …)
  • Absence of existing community for social sciences

The origins of ActiveTigger 🐯

Training classifiers for social sciences

  • Specific and frequent task : text classification
  • Encoders transformers models (BERT) are powerful
    • Easy to fine-tune
    • SOTA
  • Active Learning allows to accelerate annotation

A first prototype in R Shiny + Python in 2023

Since 2024 : ➕ Collaboration ➕ Stability ➕ Features

Main goals : an open source research software

  1. accelerate classifier training to scale annotations on a large corpus
  2. possibility to evaluate classifier performance
  3. limit dependencies on external services with small models
  4. pedagogical solution for non-expert users/training
  5. stimulate community discussion on needs & best practices
  6. promote open source & open research tools

Architecture

  • API (Python) / Frontend (React) / API Client (Python)
  • Leverage Python ML/DL packages (sklearn, transformers, …)
  • UX designed for non-expert users
    • i.e. annotation on smartphone
  • Both on premise and software as a service to adjust the needs
    • Docker compose

Demo time

  • A lot of scientific publication mentionning Python
  • Only some of them are PyData-related
  • How to annotate them PyData/Not Pydata

Demo with ActiveTigger

Dataset openalex filter to remove empty abstracts/duplicates

Connect with : pydata2025/pydata2025

https://www.css.cnrs.fr/active-tigger/

Current situation & next step

  • A community of early users
  • Stable version by the end of 2025
    • Streamline UX
    • Achieve dockerization
    • Finish documentation
  • In the future
    • Animate a community
    • Prioritize our roadmap (Bertopic)

Fundings

https://www.css.cnrs.fr/active-tigger/

Fundings : GENES / DRARI / Progedo

Main contributors : Julien Boelaert (UL) ; Étienne Ollion (CREST) ; Paul Girard (OuestWare) ; Emma Bonutti (CREST) ; Annina Claesson (CREST) ; Léo Mignot (CED) ; Jule Brion (PACTE) ; Arnault Chatelain (CREST) ; Axel Morin (CREST)