SE 26
Why This Matters
Frontier labs (OpenAI, Google, Anthropic) don't disclose how they build their training data.
Neither do the Chinese labs (DeepSeek or Qwen).
Training data is the most important ingredient in building an LLM, yet the recipes are kept secret.
Digital Sovereignty
If you can't build the data, you can't build the model.
If you can't build the model, you depend on those who can.
This work puts the knowledge out in the open for everyone: governments, universities, startups, and individuals.
Quick Recap
You use these every day: ChatGPT, Copilot, Claude.
Under the hood: a giant function that takes tokens in and predicts tokens out.
Trained by reading billions of web pages, learning to predict the next word.
Data quality defines model quality.
The Problem
DCLM: 240T tokens from Common Crawl → 1.4% survives as DCLM-Baseline
The Idea
★★★ BeSt DeAls!!!
Photosynthesis is the process by wich plants convert sunlit into energy.
It occurs in the chloroplasts
Click here for more → → →
© 2019 AllScienceInfo.biz
Carbon dioxide and water are transformed into glucose and oxygen...
[AD] [AD] [POPUP]
Q: What is photosynthesis?
A: Photosynthesis is the process by which plants convert sunlight into chemical energy.
It occurs in organelles called chloroplasts.
Q: What are the inputs and outputs?
A: Plants take in carbon dioxide (CO₂) and water (H₂O), and using light energy,
produce glucose (C₆H₁₂O₆) and oxygen (O₂).
Same knowledge, better packaging.
You keep 100% of your data instead of discarding 90%.
Our Research
Three knobs to tune: source data, prompt strategy, and generator model.
Methodology
For each experiment, we:
This is expensive so we tried proxies:
None correlated well enough.
No shortcuts: you must train and evaluate to know if your data is good.
Spoiler
Our best synthetic recipe outperforms all tested baselines, including curated web data.
Let's unpack how.
Finding #1
Structured prompts beat everything:
These beat curated web data and all prior synthetic baselines.
The prompt matters more than the model or the source data.
Finding #2
1B matches 4B, 12B, and 27B model performance.
SmolLM2-1.7B beats Qwen, Gemma, Llama, Falcon, and Granite.
And it's much faster:
Better quality and faster inference.
Finding #3
Messy beats polished.
SmolLM2's varied, inconsistent outputs outperform Qwen3's template-locked, clean outputs.
Synthetic-only fails.
You must mix synthetic data with original web data.
The mix-in dataset matters as much as the synthetic data itself.
Qwen3 Math outputs:
115 / 1000 samples start with the exact same sentence
SmolLM2 Math outputs:
Highly varied formatting and structure
Diversity beats consistency for pretraining.
Summary
Infrastructure
Each experiment generates ~15B tokens.
70+ experiments = 1T+ tokens of LLM output.
At ~4,750 tokens/sec/GPU (mean across all experiments):
You need a scalable, fault-tolerant pipeline.
Infrastructure
Infrastructure
18 models benchmarked on H100 GPUs. Two tiers of optimization.
Infrastructure
Small models + good prompts dominate the Pareto frontier.
Invest in prompt design, not model size.
Conclusion
Mixed with high-quality original data (e.g., FineWeb-Edu) for best results.
Conclusion
REWIRE's original prompt had typos. Fixing them made no measurable difference to downstream performance.
Edu-score and DCLM-score do not reliably predict downstream performance. You must train and evaluate.
Varied, inconsistent outputs from SmolLM2 beat Qwen3's polished, template-locked outputs every time.
Open Source
Future directions:
Questions?
Stay tuned for the blog post with many more details.