Achieving 10,000x training data reduction with high-fidelity labels

By adminSeptember 24, 2025No Comments2 Mins Read

Table of Contents

Experiments

We wanted to understand which models and tasks would benefit most from our curation process. As baselines for our experiments, we fine-tuned two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two tasks of different complexity (lower and higher, based on expert alignment) using crowdsourced labels. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.

We compared each of these four baseline conditions against the corresponding curated condition in which each model (Nano-1 and Nano-2) is fine-tuned over multiple rounds using the curation process described above. At each iteration, we selected our curated set of examples and used them for model evaluation and fine-tuning, as described above. All models plateaued before reaching parity with the experts’ internal alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 evaluation samples) for the lower complexity task and 5 iterations (~250 fine-tuning and ~150 evaluation samples) for the higher complexity task. (Note that the lower complexity task had a larger variety of examples, which may account for the longer time needed to converge.) Both data sets had a final class balance of ~40% positive examples.

The table below provides an overview of the scale and quality of the data used in each condition. Experts reached an average pairwise Cohen’s Kappa of .81 (on the lower complexity task) and .78 (on the higher complexity task) through the curation process. We consider these the ceiling for model performance. To assess the quality of our crowdsourced data, we calculated Kappa alignment between crowdsourced annotations and experts based on our full curated set, which was .59 (lower complexity) and .41 (higher complexity).

What's Hot

5 Things You Didn’t Know About Enabler – Enabler

#GTMTips: The Data Layer Picker Variable Template

Engineering a 10X Future: How AI is Reshaping Our World

Achieving 10,000x training data reduction with high-fidelity labels

#GTMTips: The Data Layer Picker Variable Template

Google AI Research Introduce a Novel Machine Learning Approach that Transforms TimesFM into a Few-Shot Learner

Subscribers, Revenue, Market Share & Global Reach

Subscribe to Updates

What's Hot

5 Things You Didn’t Know About Enabler – Enabler

#GTMTips: The Data Layer Picker Variable Template

Engineering a 10X Future: How AI is Reshaping Our World

Achieving 10,000x training data reduction with high-fidelity labels

Experiments

Related Posts

#GTMTips: The Data Layer Picker Variable Template

Google AI Research Introduce a Novel Machine Learning Approach that Transforms TimesFM into a Few-Shot Learner

Subscribers, Revenue, Market Share & Global Reach