The Synthetic Data Playbook

How to Cook Better Training Data for LLMs

SE 26

Why This Matters

The Data Black Box

Frontier labs (OpenAI, Google, Anthropic) don't disclose how they build their training data.

Neither do the Chinese labs (DeepSeek or Qwen).

Training data is the most important ingredient in building an LLM, yet the recipes are kept secret.

Digital Sovereignty

If you can't build the data, you can't build the model.
If you can't build the model, you depend on those who can.

This work puts the knowledge out in the open for everyone: governments, universities, startups, and individuals.

Quick Recap

LLMs: What's Under the Hood

You use these every day: ChatGPT, Copilot, Claude.

Under the hood: a giant function that takes tokens in and predicts tokens out.

Trained by reading billions of web pages, learning to predict the next word.

Data quality defines model quality.

Input text

↓

LLM
billions of parameters

↓

Output text

The Problem

You Start With the Entire Internet...

...and throw away 98.6% of it

DCLM: 240T tokens from Common Crawl → 1.4% survives as DCLM-Baseline

The Idea

Rewrite Instead of Filter

Raw Web Text

★★★ BeSt DeAls!!!
Photosynthesis is the process by wich plants convert sunlit into energy. It occurs in the chloroplasts
Click here for more → → →
© 2019 AllScienceInfo.biz
Carbon dioxide and water are transformed into glucose and oxygen... [AD] [AD] [POPUP]

→

LLM-Rewritten FAQ

Q: What is photosynthesis?
A: Photosynthesis is the process by which plants convert sunlight into chemical energy. It occurs in organelles called chloroplasts.

Q: What are the inputs and outputs?
A: Plants take in carbon dioxide (CO₂) and water (H₂O), and using light energy, produce glucose (C₆H₁₂O₆) and oxygen (O₂).

Same knowledge, better packaging.
You keep 100% of your data instead of discarding 90%.

Our Research

What's the Best Recipe?

Three knobs to tune: source data, prompt strategy, and generator model.

70+

experiments

1T+

tokens generated

60k+

GPU hours

Methodology

Our Integration Test Suite

For each experiment, we:

Train a 1.2B parameter model from scratch
Feed it 20B tokens of synthetic and original data
Test on 12 benchmarks (reading, math, reasoning, knowledge...)
Compare against curated web datasets as baselines

This is expensive so we tried proxies:

DCLM/Edu scores (used for filtering pretraining data)
Smaller training runs

None correlated well enough.

No shortcuts: you must train and evaluate to know if your data is good.

Spoiler

FinePhrase Wins

Our best synthetic recipe outperforms all tested baselines, including curated web data.

Let's unpack how.

Finding #1

Prompt Design Is the #1 Lever

Structured prompts beat everything:

Math reformatting
Table extraction
FAQ generation
Tutorial rewriting

These beat curated web data and all prior synthetic baselines.

The prompt matters more than the model or the source data.

Finding #2

Smol Models Are Enough

1B matches 4B, 12B, and 27B model performance.

SmolLM2-1.7B beats Qwen, Gemma, Llama, Falcon, and Granite.

And it's much faster:

3.0x faster than Gemma-3-12B
(9,220 vs 3,046 tps/gpu)
5.3x faster than Gemma-3-27B
(9,220 vs 1,724 tps/gpu)

Better quality and faster inference.

Finding #3

Diversity Beats Consistency

Messy beats polished.

SmolLM2's varied, inconsistent outputs outperform Qwen3's template-locked, clean outputs.

Synthetic-only fails.
You must mix synthetic data with original web data.

The mix-in dataset matters as much as the synthetic data itself.

Template Collapse

Qwen3 Math outputs:

115 / 1000 samples start with the exact same sentence

SmolLM2 Math outputs:

Highly varied formatting and structure

Diversity beats consistency for pretraining.

Summary

What We Found

Prompt design is the #1 lever. Structured formats (Math, Table, FAQ, Tutorial) outperform everything.
1B models suffice. SmolLM2-1.7B is the best rephraser across the board.
Mix original data in. Synthetic-only fails. The mix-in dataset matters.
Diversity wins over polish. Varied, messy outputs beat clean, template-locked ones.

Infrastructure

How Do You Rephrase 1T Tokens?

Each experiment generates ~15B tokens.

70+ experiments = 1T+ tokens of LLM output.

At ~4,750 tokens/sec/GPU (mean across all experiments):

~880

GPU-hours per experiment

~$3k

cloud cost per experiment

~$215k

total compute budget

You need a scalable, fault-tolerant pipeline.

Infrastructure

DataTrove + vLLM

DataTrove orchestrates the pipeline. vLLM serves the model with optimized batching and prefix caching.

Infrastructure

Throughput Optimization

18 models benchmarked on H100 GPUs. Two tiers of optimization.

Infrastructure

Cost vs. Performance

Small models + good prompts dominate the Pareto frontier.

Invest in prompt design, not model size.

Conclusion

The FinePhrase Recipe

📄

Source Data

Web text
(even low-quality)

📝

Structured Prompt

Math / Table /
FAQ / Tutorial

🤖

SmolLM2-1.7B

Small, fast,
diverse outputs

✨

FinePhrase

Best synthetic
pretraining data

Mixed with high-quality original data (e.g., FineWeb-Edu) for best results.

Conclusion

What Surprised Us

🤷

Typos Don't Matter

REWIRE's original prompt had typos. Fixing them made no measurable difference to downstream performance.

📊

Proxy Scores Lie

Edu-score and DCLM-score do not reliably predict downstream performance. You must train and evaluate.

🎲

Messier Is Better

Varied, inconsistent outputs from SmolLM2 beat Qwen3's polished, template-locked outputs every time.

Open Source

Everything Is Open

All prompts, configs, and pipeline code
Generated datasets on the Hugging Face Hub
Throughput benchmarks for 18 models
Blog post with interactive charts

Future directions:

Diffusion LMs for faster inference
Scaling to more data (ablations trained on only 21B tokens)
Mixing ratio: how little synthetic data can you get away with?
Best-of-N filtering on synthetic outputs

Thank You

Questions?

Joel Niklaus

joelniklaus

@joelniklaus

Stay tuned for the blog post with many more details.