Back to notes

Reasoning Distillation Is Not Compression Alone

2025-02-28

Why reasoning distillation remains difficult

It is tempting to describe reasoning distillation as a straightforward transfer problem. A strong model produces chain-of-thought trajectories, a smaller model is trained on those trajectories, and the result is expected to inherit some fraction of the teacher's reasoning ability. In practice, this picture is too simple.

The difficulty is not only that the student model is smaller. The deeper issue is that reasoning is not a single object that can be copied intact. It is a layered behavior involving representation, decomposition, search, error correction, and the ability to remain coherent across multiple intermediate steps. When we distill reasoning, we are deciding which of these layers the student should imitate and which it should reconstruct internally.

That is why reasoning distillation often feels unstable. A student may imitate the surface form of a chain-of-thought without acquiring the underlying capacity that made the teacher's reasoning useful in the first place.

The real design question: what is being transferred?

In many discussions, the teacher output is treated as a complete package. But teacher traces can serve several different roles:

  1. They can provide task decomposition, showing how a complex problem may be broken into smaller steps.
  2. They can provide solution structure, revealing the order in which evidence or sub-results should be combined.
  3. They can provide error signals, making visible where a wrong partial conclusion causes downstream collapse.
  4. They can provide language patterns, which are useful for readability but not always essential for competence.

These roles are not equivalent. A student that learns to reproduce stylistic reasoning markers may still fail to solve the task robustly. Conversely, a student that learns stable internal decomposition may succeed even if its generated rationale is shorter or less polished than the teacher's.

The practical implication is that reasoning distillation should be treated as a representation problem, not merely a supervised imitation problem.

Granularity matters because supervision shapes the latent program

One of the most consequential design choices is granularity. If the teacher provides only a final answer, the student is forced to infer the hidden structure of the solution on its own. If the teacher provides a very long explanation, the student may receive more information, but also more noise, redundancy, or stylistic clutter.

The right granularity is task-dependent. Some tasks benefit from compact but decisive intermediate states. Others require a longer sequence of explicit transformations. What matters is not the raw number of steps but whether the supervision exposes the critical decision points in the reasoning process.

This is one reason overly verbose chains-of-thought can be counterproductive. They do not always provide better supervision. Sometimes they simply make the student model spend capacity on the linguistic surface of reasoning instead of the operational structure that actually solves the problem.

Distillation should optimize for transfer, not imitation fidelity

A common failure mode in small-model distillation is to evaluate success by local similarity to the teacher. If the student outputs reasoning that looks close to the teacher's explanation, we may conclude that the method worked. But this is often the wrong criterion.

The relevant question is whether the student can transfer the learned structure to fresh tasks, varied prompts, or slightly different forms of difficulty. If it cannot, then the distilled reasoning was too brittle or too literal.

This suggests a more demanding evaluation standard:

  • Does the student maintain performance when the prompt style changes?
  • Does it generalize to related benchmarks rather than only the supervised format?
  • Does the reasoning improve answer quality, or merely improve explanation length?

These criteria are closer to what we actually care about: useful, portable reasoning competence.

Why this matters beyond benchmark performance

Reasoning distillation is not only about obtaining smaller models that perform well on math or logic tasks. It is part of a broader effort to make capable systems more efficient, reproducible, and deployable. Smaller models are easier to inspect, cheaper to run, and more realistic for many downstream environments.

But if we want compact models that retain high-value reasoning behavior, we need to stop thinking of distillation as model shrinking plus more labels. It is a question about how to shape the student's internal organization.

That is why I think the most useful future work in this area will focus less on copying full teacher traces verbatim and more on identifying the smallest forms of supervision that preserve structure, adaptability, and transfer.

Takeaway

Reasoning distillation works best when we ask the right question. The question is not simply: How do we make a small model repeat what a large model said? It is: What supervision helps a smaller model build a reasoning process of its own?

That distinction is easy to miss, but it changes almost every design decision that follows.