Distillation Ain't What It Used To Be

Recently I have been seeing a number of takes that most Chinese AI progress is dependent upon distillation from closed US models, especially from OpenAI and Anthropic, and that therefore if their access to these models were cut off their progress would cease or substantially slow down. I strongly disagree with this and I think that broadly the effect of distillation from the frontier has been oversold.

This is not to say distillation is not helpful. Training on synthetic data, especially from frontier models, provides a fantastic and relatively sample efficient prior to cold-start reasoing performance, especially when included early in mid-training or even in mid-pretraining¹. However, it is far from being the primary driver of progress for Chinese labs and I don’t think it would hamper them massively if it stopped.

Firstly, there is a lot of moralizing about this, but I don’t think it is really correct to think of the Chinese as doing something qualitatively different from Western labs. The Western labs are also distilling, but rather distilling from humans instead. This is what I discussed in my data progress blog, and is also evident from the success of Western data provider companies like Scale, Surge, and Mercor. These companies earn billions of dollars by paying humans experts in math, sciences, medicine, finance, etc to solve tasks and write down their reasoning. The true pipeline is Western labs distill from humans (via Scale, Mercor etc) and the Chinese distill from Western labs (and each other).

However, this isn’t some fundamental constraint on Chinese labs. There are plenty of extremely intelligent experts in China who could provide data domestically at the same level as Western data brokers. I don’t know enough about the Chinese data ecosystem to say whether this is already a major sector, but very clearly it could become one quickly if it was needed. My suspicion is that the Chinese labs prefer to distill (as also we do at Zyphra) from models instead of humans because it is vastly cheaper and more token efficient while gaining much of the same quality. Wrangling humans for this is just expensive slow and annoying and avoid it if you can.

Secondly, distillation data is mostly just used as cold-start data for midtraining and SFT. The real capabilities leaps come from the RLVR phases. RLVR is Reinforcement Learning from Verifiable Rewards. It explicitly does not include training on distilled data from other models. In fact, this is bad because it is off-policy. Rather, we train on our model’s own rollouts. RLVR is an amazing hill-climbing engine which can build on top of a relatively weak signal and maximize it in a shockingly small number of steps. This means that cutting off access to Western models will not particularly harm the Chinese RLVR efforts.

Now it is true that RLVR requires some initial traction on the problem before the hill-climbing can really commence. This is where the cold-start data comes in handy, but it is not required. In the worst case, simple best-of-N rejection sampling to do iterative self-SFT works perfectly fine, although it is a bit inefficient in compute.

Moreover, getting non-zero initial signal is rarely the actual bottleneck in RLVR. Rather it is things like the diversity and robustness (to reward hacking) of environments, the noise in the verifier, handling async off-policy bias, and so on. These are all extremely serious challenges necessitating clever thinking and good engineering, but these are all capabilities that you build out yourself rather than relying on cold-start reasoning data.

What is particularly frustrating is that this has been known literally since the beginning of reasoning models. In the original Deepseek-R1 paper, they explicitly trained a no-cold-start reasoning model from a pure pretrained base using solely RLVR. Now this wasn’t as efficient and was slightly wonky compared to starting with reasoning traces, but it is far from impossible. Instead it is already known exactly how to do this. Microsoft’s recent model also made a point of doing this exact self-origination of reasoning from scratch (although I am moderately sceptical that they managed to remove all reasoning traces from their corpus just because they are now plastered all over the internet).

Although I have no non-public information here, I strongly suspect that Western labs also do train on synthetic data (probably from their own prior generations of models) in pretraining and mid-training. In the GPT5 release, OpenAI made a big deal about how they used synthetic data for pretraining, and I suspect the other labs do it too. ↩