The Synthetic Data Playbook


How to Cook Better Training Data for LLMs


SE 26

The Data Black Box

Frontier labs (OpenAI, Google, Anthropic) don't disclose how they build their training data.

Neither do the Chinese labs (DeepSeek or Qwen).

Training data is the most important ingredient in building an LLM, yet the recipes are kept secret.

Digital Sovereignty

If you can't build the data, you can't build the model.
If you can't build the model, you depend on those who can.

This work puts the knowledge out in the open for everyone: governments, universities, startups, and individuals.

LLMs: What's Under the Hood

You use these every day: ChatGPT, Copilot, Claude.

Under the hood: a giant function that takes tokens in and predicts tokens out.

Trained by reading billions of web pages, learning to predict the next word.

Data quality defines model quality.

Input text
LLM
billions of parameters
Output text

You Start With the Entire Internet...

...and throw away 98.6% of it

DCLM: 240T tokens from Common Crawl → 1.4% survives as DCLM-Baseline

Rewrite Instead of Filter

Raw Web Text

★★★ BeSt DeAls!!!
Photosynthesis is the process by wich plants convert sunlit into energy. It occurs in the chloroplasts
Click here for more → → →
© 2019 AllScienceInfo.biz
Carbon dioxide and water are transformed into glucose and oxygen... [AD] [AD] [POPUP]

LLM-Rewritten FAQ

Q: What is photosynthesis?
A: Photosynthesis is the process by which plants convert sunlight into chemical energy. It occurs in organelles called chloroplasts.

Q: What are the inputs and outputs?
A: Plants take in carbon dioxide (CO₂) and water (H₂O), and using light energy, produce glucose (C₆H₁₂O₆) and oxygen (O₂).

Same knowledge, better packaging.
You keep 100% of your data instead of discarding 90%.

What's the Best Recipe?

Three knobs to tune: source data, prompt strategy, and generator model.

70+
experiments
1T+
tokens generated
60k+
GPU hours

Our Integration Test Suite

For each experiment, we:

  • Train a 1.2B parameter model from scratch
  • Feed it 20B tokens of synthetic and original data
  • Test on 12 benchmarks (reading, math, reasoning, knowledge...)
  • Compare against curated web datasets as baselines

This is expensive so we tried proxies:

  • DCLM/Edu scores (used for filtering pretraining data)
  • Smaller training runs

None correlated well enough.

No shortcuts: you must train and evaluate to know if your data is good.

FinePhrase Wins

Our best synthetic recipe outperforms all tested baselines, including curated web data.

Let's unpack how.

Prompt Design Is the #1 Lever

Structured prompts beat everything:

  • Math reformatting
  • Table extraction
  • FAQ generation
  • Tutorial rewriting

These beat curated web data and all prior synthetic baselines.

The prompt matters more than the model or the source data.

Smol Models Are Enough

1B matches 4B, 12B, and 27B model performance.

SmolLM2-1.7B beats Qwen, Gemma, Llama, Falcon, and Granite.

And it's much faster:

  • 3.0x faster than Gemma-3-12B
    (9,220 vs 3,046 tps/gpu)
  • 5.3x faster than Gemma-3-27B
    (9,220 vs 1,724 tps/gpu)

Better quality and faster inference.

Diversity Beats Consistency

Messy beats polished.

SmolLM2's varied, inconsistent outputs outperform Qwen3's template-locked, clean outputs.

Synthetic-only fails.
You must mix synthetic data with original web data.

The mix-in dataset matters as much as the synthetic data itself.

Template Collapse

Qwen3 Math outputs:

115 / 1000 samples start with the exact same sentence

SmolLM2 Math outputs:

Highly varied formatting and structure

Diversity beats consistency for pretraining.

What We Found

  • Prompt design is the #1 lever. Structured formats (Math, Table, FAQ, Tutorial) outperform everything.
  • 1B models suffice. SmolLM2-1.7B is the best rephraser across the board.
  • Mix original data in. Synthetic-only fails. The mix-in dataset matters.
  • Diversity wins over polish. Varied, messy outputs beat clean, template-locked ones.

How Do You Rephrase 1T Tokens?

Each experiment generates ~15B tokens.

70+ experiments = 1T+ tokens of LLM output.

At ~4,750 tokens/sec/GPU (mean across all experiments):

~880
GPU-hours per experiment
~$3k
cloud cost per experiment
~$215k
total compute budget

You need a scalable, fault-tolerant pipeline.

DataTrove + vLLM

DataTrove orchestrates the pipeline. vLLM serves the model with optimized batching and prefix caching.

Throughput Optimization

18 models benchmarked on H100 GPUs. Two tiers of optimization.

Cost vs. Performance

Small models + good prompts dominate the Pareto frontier.

Invest in prompt design, not model size.

The FinePhrase Recipe

📄
Source Data
Web text
(even low-quality)
+
📝
Structured Prompt
Math / Table /
FAQ / Tutorial
+
🤖
SmolLM2-1.7B
Small, fast,
diverse outputs
=
FinePhrase
Best synthetic
pretraining data

Mixed with high-quality original data (e.g., FineWeb-Edu) for best results.

What Surprised Us

🤷

Typos Don't Matter

REWIRE's original prompt had typos. Fixing them made no measurable difference to downstream performance.

📊

Proxy Scores Lie

Edu-score and DCLM-score do not reliably predict downstream performance. You must train and evaluate.

🎲

Messier Is Better

Varied, inconsistent outputs from SmolLM2 beat Qwen3's polished, template-locked outputs every time.

Everything Is Open

  • All prompts, configs, and pipeline code
  • Generated datasets on the Hugging Face Hub
  • Throughput benchmarks for 18 models
  • Blog post with interactive charts

Future directions:

  • Diffusion LMs for faster inference
  • Scaling to more data (ablations trained on only 21B tokens)
  • Mixing ratio: how little synthetic data can you get away with?
  • Best-of-N filtering on synthetic outputs

Thank You

Questions?

Joel Niklaus

Stay tuned for the blog post with many more details.