Technical Report Interpretation for Deepseek

type

status

date

slug

summary

Starting from the DeepSeek R1 paper itself:

DeepSeek itself does not use methods like ChatGPT-3.5 that use time scaling during test time to gain strong reasoning performance. Instead, it uses post-training enhancement.

It may look like ChatGPT is not that strong (e.g. it might fail 0/1 when first prompted), but once prompted to “think step-by-step,” the “search reasoning process” kicks in and outputs the correct answer. This type of improvement is several orders of magnitude (perhaps 4x). However, when evaluated on benchmarks or reasoning tasks, the final performance ends up comparable.

Therefore, in terms of reasoning and price-performance ratio, DeepSeek R1’s 27x cost drop is quite reasonable.

DeepSeek-R1 is based on the "DeepSeek-V3-Base" model, which is further trained via RL-based learning.

Its reward mechanism is based on rule-based reward systems, rather than process-based neural rewards or outcome rewards.

The reward is divided into two parts:

Accuracy Reward:

For questions with definite answers, it evaluates whether the model’s answer is correct.

Format Reward:

Ensures the model’s answers follow specific response formats, e.g.:

It may be due to the use of rule-based rewards that the model learns to produce its own reasoning “aha moments”.

The model begins to exhibit self-generated CoT reasoning chains, along with reflection and even exploration-like behavior. These might represent signs of growing emergent reasoning ability.

DeepSeek-R1 is the first to demonstrate that under conditions without supervised annotations, an LLM can improve its reasoning via reinforcement learning (RL).

Through this rule-based reward + GRPO, there is no need to manually write or annotate complex CoT data.

The model’s reasoning ability is directly improved to the 10× level while keeping costs low.

Why hasn’t this RL-based method for improving reasoning ability been widely tried before?

The theory was there (probably), but most LLaMA-based models lacked strong reasoning abilities.

The GSM8K benchmark remained low, and reasoning wasn't a major differentiator.

Plus, the HuggingFace ecosystem lacked good open-source CoT datasets.

Now that DeepSeek has solved the pre-training phase and cost issues, they can focus on strengthening reasoning post-training.

Another key point is that “standard supervised fine-tuning” often does not generate true reasoning

but rather encourages models to memorize and match patterns from training data.

Only through RL and reward learning can models deeply grasp the underlying rules. RL often moves models in “unexpected but correct” directions—helpful for generalization.

There’s also the question of scalability:

Can DeepSeek-R1’s low-cost training approach scale up to larger base models with more compute?

If it can, we could continue pushing the frontier forward via RL fine-tuning.

From this perspective, DeepSeek R1’s method may well be a scalable method for frontier alignment (on track).

If DeepSeek-R1’s low-cost training approach is scalable, then this represents another step forward in the broader sense of the “scaling law” —

a new trajectory that continues to follow the scaling law (on track).

On DeepSeek's impact on compute:

This new round of compute competition is driven by the assumption that scaling law gains haven’t hit a ceiling (confirmed by Mark).

That’s why everyone is rushing in now—this wave is different from the past.

So the real question isn’t whether DeepSeek uses only 1/10 the compute to get similar results.

The key is whether DeepSeek’s method is scalable and can maintain the trajectory of the scaling law.

That’s the real differentiating factor.

As long as the scaling law remains valid, compute capex will continue increasing.

Just like Moore’s Law — as long as the law holds, chip markets grow and R&D costs rise.

DeepSeek’s low-cost training method can be viewed as a continuation of that scaling trajectory:

A new curve that keeps Moore’s Law going by optimizing architecture and cost.

To illustrate: if there’s a breakthrough that lets each chip’s compute density (gates per area) double,

the industry would panic about power and heat.

But if instead the breakthrough allows fabless chip design to be done more efficiently (like Intel/AMD/Qualcomm),

that would be a cost innovation — no panic, just transformation.

Back to DeepSeek:

What it provides is a Moore-like cost efficiency gain on the algorithmic and engineering level.

From a macro perspective, AI compute’s ultimate bottleneck is energy and fabrication.

The long-term trend is that every 10 years we expect a 3-order-of-magnitude improvement:

software/algorithms +3 orders

hardware +3 orders

compute efficiency +3 orders

Now we’re at the GPT-4 level, and the cost is astronomical—likely above $10M per model.

Even fine-tuning costs are extremely high.

If scalable methods like DeepSeek’s can lower the cost of training models of this scale,

and continue improving model performance through cost-efficient techniques,

we could re-enter a “low-cost + high-efficiency” curve that unlocks new capabilities.

This could spark more innovation by freeing up resources, leading to a second wave of scaling law engineering breakthroughs.

Why was the AI community so enthusiastic about DeepSeek?

Not just because it was “cheap” — more importantly, it proved a new, more efficient path for training strong AI models.

Think about it:

If DeepSeek’s approach is scalable, then this path could extend all the way to the limits of the frontier,

or even enable AGI/ASI-level reasoning under much lower compute budgets.