2512.02556 / / LLM

DeepSeek-V3: Scaling Inference-Time Compute

DeepSeek-AI

Core Insight

A lightweight "lightning indexer" can learn which tokens matter for attention, reducing O(L²) to O(Lk) complexity while preserving quality. Combined with allocating >10% of pre-training compute to post-training RL, this unlocks frontier-level reasoning in open models.

Why Previous Approaches Failed

Three structural problems held open-source models back from frontier performance:

1. Quadratic Attention Bottleneck

Standard attention computes all L² pairwise interactions between tokens. At 128K context length:

  • Dense attention: 128K × 128K = 16 billion operations per layer
  • This becomes prohibitively expensive for both inference and RL training
  • RL requires many rollouts—each rollout at 128K context explodes compute

2. Underinvestment in Post-Training

Open models typically allocate <1% of pre-training compute to RLHF/post-training. This caps the reasoning ceiling—the model learns facts during pre-training but not how to think through hard problems. DeepSeek found that the reasoning gap to frontier models closed dramatically when they 10x'd post-training investment.

3. Reasoning-Tool Disconnect

Models either reason OR use tools, but combining them was broken:

  • Each tool call would discard the reasoning context
  • Model had to re-reason from scratch after each tool response
  • This created redundant computation and lost reasoning chains

The Method

DeepSeek-V3 addresses each failure mode with specific mechanisms:

1. DeepSeek Sparse Attention (DSA)

Instead of attending to all previous tokens, a small "lightning indexer" network learns to score token relevance in real-time:

Lightning Indexer Score
$$I_{t,s} = \sum_{j=1}^{H_I} w_{t,j}^I \cdot \text{ReLU}(q_{t,j}^I \cdot k_s^I)$$

The indexer is a tiny attention mechanism (few heads, FP8 precision) that runs before the main attention. It produces scores for all previous tokens, then selects the top-k (k=2048) for full attention computation.

Complexity reduction:

  • Dense: O(L²) → 128K × 128K = 16B ops
  • Sparse (DSA): O(L × k) → 128K × 2K = 256M ops
  • ~60x reduction in attention compute

2. Scaled RL Post-Training

They allocate more than 10% of pre-training compute to reinforcement learning—dramatically more than typical open models. Key stabilization techniques:

  • Unbiased KL estimates: Standard KL estimator is biased when policies diverge. They use importance sampling correction.
  • Off-policy masking: For negative-advantage sequences with high policy divergence, mask them from gradient updates to prevent instability.
  • Keep Routing: In MoE models, preserve expert routing paths from inference during training. Mismatch causes instability.
  • Keep Sampling Mask: Maintain identical action spaces between old and new policies.

3. Thinking-in-Tool-Use

Context management that retains reasoning traces across tool calls:

  • Only discard reasoning when a new user message arrives
  • Tool outputs don't trigger context reset
  • Full tool call history is always preserved

This lets the model build on previous reasoning after each tool call instead of starting fresh.

Architecture

DeepSeek Sparse Attention (DSA)Input Hidden h_tCurrent token embeddingLightning IndexerFew heads, FP8, fastQuery ProjectionFull precisionTop-k Selectork = 2048 tokensI_t,s scoresSparse Attention(only k selected tokens)Output u_tClick to step through attention mechanismCOMPLEXITY COMPARISONDense AttentionO(L²)128K tokens = 16B ops/layerSparse Attention (DSA)O(L × k)128K × 2K = 256M ops (~60x faster)TRAINING PIPELINEPre-trainingDSA TrainingSpecialist RL(math, code)Mixed RL(>10% compute)V3KEY INSIGHTPost-training RL budget matters more than model size.>10% of pre-train compute on RL unlocks frontier reasoning.

Click diagram to step through

Key Equations

GRPO Objective with Off-Policy Masking
$$J_{GRPO}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\min\left(r_{i,t}(\theta)\hat{A}_{i,t}, \text{clip}(r_{i,t})\hat{A}_{i,t}\right) \cdot M_{i,t}\right]$$

The mask $M_{i,t}$ zeros out negative-advantage sequences with high policy divergence. Without this, off-policy samples with large likelihood ratios create unstable gradients that can crash training.

Unbiased KL Estimate via Importance Sampling
$$D_{KL}(\pi_\theta \| \pi_{ref}) = \frac{\pi_\theta(o_{i,t})}{\pi_{old}(o_{i,t})}\left(\frac{\pi_{ref}(o_{i,t})}{\pi_\theta(o_{i,t})} - \log\frac{\pi_{ref}(o_{i,t})}{\pi_\theta(o_{i,t})} - 1\right)$$

Standard K3 KL estimator is biased when $\pi_\theta \ll \pi_{ref}$ (policy has moved far from reference). This importance-weighted correction eliminates the bias, enabling stable training even with aggressive policy updates.

Results

BenchmarkDeepSeek-V3GPT-5-HighGemini-3.0-Pro
AIME 202593.1%94.6%95.0%
HMMT Feb 202592.5%88.3%97.5%
HLE (text-only)25.1%26.3%37.7%
SWE-Verified73.1%74.9%76.2%
BrowseComp67.6%54.9%60.2%

Extended thinking variant (V3-Speciale):

  • AIME: 96.0%
  • HMMT: 99.2%
  • Gold medals in IMO 2025 and IOI 2025

Trade-off: Speciale requires ~2x more tokens than Gemini-3.0-Pro for equivalent performance. The intelligence density per token is lower, but total capability is frontier-level.

What Actually Matters

What Actually Drives the Gains?

1. RL compute scaling is the key differentiator. Performance correlates strongly with RL budget. Allocating >10% of pre-training cost to post-training is what separates frontier from near-frontier. Most open models do <1%.

2. Synthetic agentic tasks enable transfer. RL on 1,827 synthesized environments + 85K prompts transfers to unseen benchmarks. Critical finding: RL on only code/search environments does NOT transfer well. Diversity matters.

3. Off-policy masking is crucial. Without masking negative-advantage sequences with high policy divergence, certain training scenarios exhibit divergence. The mask prevents unstable gradient updates from corrupting the policy.

4. Keep Routing for MoE. Preserving expert routing paths between inference and training eliminates a major source of instability. If you let routing change during training, experts learn conflicting behaviors.

5. Lightning indexer quality. The indexer must be trained carefully—if it drops important tokens, attention quality degrades. They find FP8 with few heads is the sweet spot: fast enough to preserve speedup, accurate enough to select well.

Assumptions & Limitations

Token efficiency gap. V3-Speciale needs significantly more tokens than Gemini-3.0-Pro for equivalent performance on hard problems. The intelligence density per token is lower—it thinks longer to reach the same conclusions.

Knowledge breadth. Fewer total training FLOPs means narrower world knowledge compared to frontier proprietary models. On obscure trivia and specialized domains, the gap is noticeable.

Context management fragility. Agent frameworks that simulate tool calls via user messages (like some code assistants) break the reasoning persistence mechanism. They discard context at the wrong boundaries.

Self-verification loops. The model frequently over-verifies, generating long trajectories that exceed 128K context. This is particularly problematic on MCP benchmarks where it keeps checking its work.

Bottom Line

Open-source models can match GPT-5 on reasoning if you (1) invest heavily in post-training RL (>10% of pre-train compute) and (2) solve the attention efficiency bottleneck. The gap to Gemini-3.0-Pro is token efficiency, not capability ceiling.