DeepSeek V4-Pro and V4-Flash: Migration Guide and API Setup
DeepSeek dropped two new models on April 24, 2026: V4-Pro, a 1.6-trillion-parameter MoE flagship, and V4-Flash, a 284-billion-parameter workhorse optimized for throughput. Both support a one-million-token context window, dual Thinking/Non-Thinking modes, and an OpenAI-compatible API — available immediately on DeepSeek's platform and across third-party providers.
There's a deadline attached. The legacy deepseek-chat and deepseek-reasoner model names are being retired on July 24, 2026, 15:59 UTC. If your application targets either of those strings, you have roughly three months to update a single line of code.
This guide covers what changed architecturally, how V4-Pro and V4-Flash differ, how the benchmarks and pricing compare to frontier alternatives, and exactly how to migrate — with copy-paste code examples.
Why This Matters
DeepSeek's V4 release lands at a moment when the frontier pricing war has broken wide open. GPT-5.5 output tokens cost $30/M; Claude Opus 4.7 costs $25/M. V4-Pro output is $3.48/M — roughly one-seventh the price of GPT-5.5 — while scoring within single-digit percentage points of both models on most coding and reasoning benchmarks.
For cost-sensitive production deployments — high-volume agents, RAG pipelines, code review bots, document analysis systems — the V4 release changes the default calculus. The question is no longer "is open-source good enough?" but "which workloads still justify the closed-source premium?"
The migration urgency adds a second dimension: any team still using deepseek-chat or deepseek-reasoner in production needs to act before July 24.
Model Overview: V4-Pro vs V4-Flash
| Spec | DeepSeek V4-Pro | DeepSeek V4-Flash |
|---|---|---|
| Total Parameters | 1.6T (49B active) | 284B (13B active) |
| Architecture | MoE + Hybrid Attention | MoE + Hybrid Attention |
| Context Window | 1,000,000 tokens | 1,000,000 tokens |
| Reasoning Modes | Thinking + Non-Thinking | Thinking + Non-Thinking |
| Input Pricing (cache miss) | $1.74/M tokens | $0.14/M tokens |
| Input Pricing (cache hit) | $0.145/M tokens | $0.028/M tokens |
| Output Pricing | $3.48/M tokens | $0.28/M tokens |
| License | Apache 2.0 | MIT |
| Weights on Hugging Face | Yes | Yes |
| Best For | Complex reasoning, agents, coding | High-throughput, cost-sensitive tasks |
Both models also receive a 50% discount during Beijing off-peak hours — relevant for batch jobs that don't need real-time response.
V4-Pro is designed for tasks where quality ceiling matters: complex multi-step agents, competitive programming, research-grade reasoning, long-document analysis across the full 1M context.
V4-Flash replaces both deepseek-chat (non-thinking mode) and deepseek-reasoner (thinking mode) in the transition mapping. At $0.28/M output tokens, it's the right default for classification, summarization, extraction, customer-facing chat, and high-volume pipelines where V4-Pro's quality headroom goes unused.
What's New: The Hybrid Attention Architecture
The architectural upgrade that makes V4's 1M context practical is the Hybrid Attention Architecture (HAA) — a combination of two complementary attention strategies applied layer by layer.
Compressed Sparse Attention (CSA)
CSA first compresses KV caches along the sequence dimension (compression rate 4 in V4), then applies DeepSeek Sparse Attention. A "lightning indexer" selects the top-k most relevant compressed KV entries per query: V4-Pro selects the top 1,024; V4-Flash selects the top 512.
This gives the model a high-precision view of the most relevant context chunks — similar to how a search index retrieves only the best-matching documents rather than scanning everything.
Heavily Compressed Attention (HCA)
HCA applies a much more aggressive compression rate of 128, then performs dense attention over that smaller representation. Every layer gets a cheap, global view of distant tokens — the model always knows roughly what happened 800K tokens ago, even if it can't recall exact details.
The Combined Effect
By routing between CSA and HCA at every depth, V4 avoids the standard memory explosion of full attention at 1M tokens. The result:
- 27% of single-token inference FLOPs compared to DeepSeek-V3.2 at equivalent context
- 10% of KV cache memory compared to V3.2
- Usable 1M context on standard inference hardware rather than requiring specialized memory configurations
DeepSeek also trained V4 on 32T+ tokens using FP4 + FP8 mixed precision (MoE experts at FP4, most parameters at FP8), which contributes to the efficiency advantage over V3.2.
Manifold-Constrained Hyper-Connections (mHC)
A secondary architectural addition: mHC strengthens conventional residual connections to improve signal propagation stability across the model's many layers. The practical effect is more stable training and better performance on tasks requiring deep multi-step reasoning.
Benchmarks: How V4-Pro Stacks Up
| Benchmark | V4-Pro | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 80.6% | ~80.4% | — | — |
| LiveCodeBench | 93.5 | 88.8 | — | 91.7 |
| Codeforces Rating | 3206 | — | 3168 (GPT-5.4) | — |
| BrowseComp (agentic search) | 83.4% | — | — | — |
| MCPAtlas (tool orchestration) | 73.6% | — | — | — |
| HMMT 2026 math | 95.2% | 96.2% | 97.7% (GPT-5.4) | — |
| SimpleQA factual recall | 57.9% | — | — | 75.6% |
| Output cost / M tokens | $3.48 | $25.00 | $30.00 | — |
Where V4-Pro leads: coding. LiveCodeBench 93.5 puts it ahead of both Gemini 3.1 Pro (91.7) and Claude (88.8). On real-world competitive programming via Codeforces rating (3206), it beats GPT-5.4 (3168). SWE-bench Verified (80.6%) lands within 0.2 points of Claude Opus 4.6.
Where it trails: factual knowledge retrieval. SimpleQA-Verified at 57.9% versus Gemini's 75.6% is a meaningful gap for applications that need reliable factual recall (knowledge base Q&A, citation-heavy document work). Advanced math competition problems (HMMT 2026) show Claude (96.2%) and GPT-5.4 (97.7%) pulling ahead.
The agentic benchmarks are the headline: BrowseComp 83.4% and MCPAtlas 73.6% suggest V4-Pro is genuinely competitive at autonomous multi-step tasks — not just raw text generation.
Migration Guide: Updating from deepseek-chat and deepseek-reasoner
The migration is a one-line change in most codebases. DeepSeek kept the base URL and request/response shapes identical. Only the model parameter changes.
Legacy model mapping
| Old model name | New equivalent | Mode |
|---|---|---|
deepseek-chat |
deepseek-v4-flash |
Non-Thinking |
deepseek-reasoner |
deepseek-v4-flash |
Thinking |
During the transition window (until July 24), the legacy names are already silently routing to V4-Flash. After the deadline, they will return errors.
Python migration (OpenAI SDK)
Before:
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-chat", # ← retiring July 24
messages=[{"role": "user", "content": "Review this code: ..."}]
)
After (drop-in replacement):
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-flash", # ← updated
messages=[{"role": "user", "content": "Review this code: ..."}]
)
To upgrade to V4-Pro for higher-quality outputs:
response = client.chat.completions.create(
model="deepseek-v4-pro", # ← flagship model
messages=[{"role": "user", "content": "Review this code: ..."}]
)
Enabling Thinking mode
Both V4-Pro and V4-Flash support explicit Thinking mode. Pass thinking in the model field suffix or use the reasoning_effort parameter — DeepSeek supports three levels:
# Thinking mode: for complex reasoning chains
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Solve this step by step: ..."}],
extra_body={"thinking": True} # enable extended reasoning
)
# Non-Thinking mode (default): for faster, cheaper completions
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Summarize this document: ..."}]
# thinking defaults to False
)
If you were previously using deepseek-reasoner specifically for its chain-of-thought behavior, migrate to deepseek-v4-flash with "thinking": True — that maps directly to the same reasoning capability at the same price tier.
TypeScript / Node.js migration
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DEEPSEEK_API_KEY,
baseURL: "https://api.deepseek.com",
});
// V4-Flash: fast, cheap, suitable for most tasks
const response = await client.chat.completions.create({
model: "deepseek-v4-flash",
messages: [{ role: "user", content: "..." }],
});
// V4-Pro: best quality for coding/agent tasks
const proResponse = await client.chat.completions.create({
model: "deepseek-v4-pro",
messages: [{ role: "user", content: "..." }],
});
Environment variable approach for easy switching
If you centralize your model name:
import os
# Set in .env or deployment config:
# DEEPSEEK_MODEL=deepseek-v4-pro
# DEEPSEEK_MODEL=deepseek-v4-flash
model = os.getenv("DEEPSEEK_MODEL", "deepseek-v4-flash")
response = client.chat.completions.create(
model=model,
messages=[...]
)
This makes it easy to A/B test Pro vs Flash across deployments without touching application code.
Third-Party Providers
DeepSeek V4 is available on multiple inference providers — useful if you need lower latency, specific geographic regions, or pay-per-use billing alternatives:
| Provider | Models Available | Notes |
|---|---|---|
| DeepSeek API | V4-Pro, V4-Flash | Official, cheapest at peak pricing |
| Together AI | V4-Pro | deepseek-ai/DeepSeek-V4-Pro |
| OpenRouter | V4-Pro, V4-Flash | Unified key across providers |
| DeepInfra | V4-Pro | Low-latency EU and US endpoints |
| APIYI | V4-Flash | 5-minute migration guide available |
| Hugging Face | Both (weights) | Self-host via vLLM or TGI |
For most teams, the DeepSeek API is the default. But if you're already using Together AI or OpenRouter for provider routing, both models are available there without a separate key.
Self-Hosting Options
Both models have weights published on Hugging Face:
deepseek-ai/DeepSeek-V4-Pro(Apache 2.0)deepseek-ai/DeepSeek-V4-Flash(MIT)
V4-Flash is the realistic self-host option for most teams. At 284B total parameters with 13B active per forward pass, it runs on multi-GPU setups in FP4/FP8 with reasonably sized hardware. V4-Pro at 1.6T is a data-center-scale deployment — full-scale self-hosting requires significant infrastructure, though quantized versions via Unsloth (unsloth/DeepSeek-V4-Pro) reduce that burden.
vLLM added native support for V4's Hybrid Attention Architecture shortly after release, making it the preferred inference framework for self-hosted deployments. DeepSeek also noted close integration with Huawei's Ascend chips for organizations running on Chinese cloud infrastructure.
Cost Comparison: When to Use Which Tier
At $0.28/M output tokens, V4-Flash is the right default for:
- High-volume classification and extraction pipelines
- Customer support chatbots
- Real-time summarization
- Any task where
deepseek-chatwas already sufficient
At $3.48/M output tokens, V4-Pro makes sense when:
- You're building coding agents that need reliable SWE-bench-level performance
- Tasks require multi-step agentic reasoning (MCPAtlas-class tool orchestration)
- Documents approach 100K+ tokens and you need deep contextual understanding
- Long-context retrieval across 500K–1M token windows
Even V4-Pro at $3.48/M is a significant discount from frontier alternatives. For a team generating 100M output tokens/month:
- V4-Flash: $28/month
- V4-Pro: $348/month
- Claude Opus 4.7: $2,500/month
- GPT-5.5: $3,000/month
The cash-per-quality tradeoff genuinely favors V4-Pro for most mid-complexity workloads that aren't specifically HMMT-level math or heavy factual recall.
Common Mistakes to Avoid
Migrating to the wrong tier: deepseek-reasoner → deepseek-v4-flash (not v4-pro). V4-Flash with thinking mode is the direct functional equivalent of deepseek-reasoner. You don't need to upgrade to V4-Pro just because you were using the reasoning model.
Not setting cache_control breakpoints: V4-Pro cache-hit input is $0.145/M versus $1.74/M for cache-miss — a 12x difference. For agent loops that repeat the same system prompt and tool definitions, prompt caching can cut input costs by 90%. Structure your messages to keep the cacheable prefix stable (system prompt → tools → documents → conversation history → current user message).
Ignoring the 50% off-peak discount: If you're running batch jobs, scheduling them during Beijing off-peak hours halves the cost. For teams in UTC±8 time zones this is trivial to configure; for US/EU teams, a simple cron job targeting the overnight batch window captures the discount without user-facing latency impact.
Assuming Thinking mode is always better: V4-Pro and V4-Flash both support Thinking mode, but enabling it adds latency and cost. Use it for complex multi-step problems where the reasoning chain genuinely helps — not for simple extraction or summarization tasks where it adds overhead without quality benefit.
Testing only on your original benchmark: SimpleQA-Verified at 57.9% is a real gap. If your application depends on factual knowledge retrieval (especially for niche or recent information), test V4-Pro against your specific dataset before committing — Gemini 3.1 Pro may outperform here even at higher cost.
FAQ
Q: Is July 24, 2026 a hard cutoff for deepseek-chat and deepseek-reasoner?
Yes. DeepSeek has stated that deepseek-chat and deepseek-reasoner will be fully retired and inaccessible after July 24, 2026, 15:59 UTC. Any request using those model names after that time will return an error. Plan your migration with time to test — migrating in the final week is risky for production systems.
Q: Do I need to change my base_url when migrating?
No. The base URL (https://api.deepseek.com) remains the same. Only the model parameter in your request body changes. The request and response shapes are unchanged, so existing parsing code requires no modification.
Q: What is the difference between V4-Pro and V4-Flash in thinking mode?
Both models support Thinking and Non-Thinking modes. The difference is capability ceiling: V4-Pro has 49B active parameters versus V4-Flash's 13B, which gives it substantially better performance on complex reasoning and coding tasks even in thinking mode. V4-Flash thinking mode is appropriate for moderately complex problems at lower cost; V4-Pro thinking mode is for tasks where you need the highest quality available.
Q: Can I self-host DeepSeek V4-Flash without special hardware?
V4-Flash at 284B total parameters (13B active per forward pass) is feasible on multi-GPU servers with 80GB+ VRAM total. Quantized variants via Unsloth lower the memory requirements further. V4-Pro self-hosting requires significantly more resources due to the 1.6T total parameter count.
Q: Is DeepSeek V4 suitable for agentic workflows with tool calling?
Yes, this is one of V4-Pro's demonstrated strengths. MCPAtlas score of 73.6% measures tool orchestration performance; BrowseComp at 83.4% covers autonomous search-and-retrieval agents. V4-Pro is competitive with frontier closed-source models on these benchmarks while costing 7-9x less per token.
Q: Does V4 support multimodal inputs?
Currently, both V4-Pro and V4-Flash are text-only. DeepSeek stated they are "working on incorporating multimodal capabilities," but no release date has been announced. For vision tasks, Gemini 3.1 or GPT-5.5 remain the options.
Key Takeaways
- Migrate before July 24, 2026:
deepseek-chat→deepseek-v4-flash,deepseek-reasoner→deepseek-v4-flashwith thinking mode enabled. One line of code change. - V4-Flash is the default upgrade: cheaper than any frontier alternative at $0.28/M output, with 1M context and dual thinking modes built in.
- V4-Pro for coding and agents: SWE-bench Verified 80.6%, LiveCodeBench 93.5, Codeforces 3206 — leads the field on coding benchmarks at $3.48/M output (1/7th of GPT-5.5).
- Hybrid Attention Architecture makes 1M context practical: 27% of FLOP and 10% of KV cache vs V3.2, enabling long-context retrieval at inference cost that previously required far more hardware.
- Self-hosting is viable for Flash: Apache 2.0 (V4-Pro) and MIT (V4-Flash) licenses, weights on Hugging Face, vLLM support. Flash's 13B active parameters make it runnable on a multi-GPU server.
- Watch for factual recall gaps: SimpleQA at 57.9% means V4-Pro isn't the right choice for factual knowledge-heavy applications. Test on your specific dataset before committing.
DeepSeek V4-Pro is the best cost-per-quality option for coding and agentic workloads in April 2026, period. V4-Flash replaces deepseek-chat at the same price tier with a much larger context window. Migrate both before July 24 — it's one line of code and there's no reason to wait.
Prefer a deep-dive walkthrough? Watch the full video on YouTube.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.