DeepSeek V3.2: Thinking and Tool Use in One API Call
When OpenAI released o1, one of its most celebrated features was the ability to think before answering. But there was a catch: when you needed the model to call a tool — run code, fetch a web page, query a database — thinking had to stop. You got either reasoning or action, not both at once.
DeepSeek V3.2 is the first model that removes that constraint entirely. It can reason through a problem while calling tools, and then continue reasoning with the results. That distinction matters far more than any benchmark number.
Why This Matters for Production AI
Most real agentic workflows look like this: gather information → reason about it → decide what to do next → gather more information → conclude. The loop between "thinking" and "acting" is the entire job.
Prior to V3.2, developers building with reasoning models had two bad options. You could use a reasoning-capable model (like deepseek-reasoner or Claude's extended thinking) but sacrifice tool use, forcing you to handle tool orchestration entirely in application code. Or you could use a tool-capable model but lose the structured thinking that makes complex multi-step problems tractable.
DeepSeek V3.2 collapses these two into a single API call. It is also MIT licensed, which means you can download weights, self-host, fine-tune, and deploy without restriction. At $0.14 per million input tokens on the managed API, it is roughly 10× cheaper than GPT-5 Standard and 100× cheaper than Claude Opus 4.
Architecture: 685B Parameters, 37B Active
DeepSeek V3.2 uses a Mixture of Experts (MoE) architecture: 685 billion total parameters, but only 37 billion activate for any given token. This is the same approach as Mistral's Mixtral and GPT-4, where a routing mechanism selects which subset of "expert" neural networks handles each input.
What V3.2 adds over V3.1 and V3:
DeepSeek Sparse Attention (DSA) replaces the earlier Multi-head Latent Attention (MLA). DSA reduces the computational complexity of attention operations while preserving long-context performance. Combined with Multi-head Latent Attention-based key/value compression, the model supports a 128,000-token context window without prohibitive memory costs. You can fit an entire mid-sized codebase, or a full day's chat history, into a single request.
Expert Specialization via RL — A reinforcement learning training pass targeted specific expert subsets for reasoning tasks and for programming (supporting 338 languages). When V3.2 detects complex multi-step reasoning, it routes tokens through the reasoning-optimized experts. When it sees code, the programming experts activate.
Agentic Task Synthesis Pipeline — The post-training data included over 1,800 simulated environments and 85,000+ complex instructions, specifically to train the model to complete multi-step agentic tasks reliably.
The net result: V3.2 scored 96.0% on AIME 2025 (compared to 94.6% for GPT-5-High), achieved gold-medal performance at the International Mathematical Olympiad and International Olympiad in Informatics, and performs comparably to GPT-5 on most coding benchmarks — at a fraction of the price.
The Killer Feature: Thinking While Using Tools
In V3.2, the model can:
- Receive a user message
- Enter thinking mode and reason through the problem
- Emit a tool call (mid-reasoning)
- Receive the tool result
- Continue reasoning with that result
- Emit another tool call if needed
- Produce a final answer
All within a single streamed API response.
Here is what the Python code looks like using the DeepSeek-compatible OpenAI SDK:
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
tools = [
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current price of a stock ticker",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol, e.g. AAPL"
}
},
"required": ["ticker"]
}
}
}
]
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": "Should I buy NVDA today? Analyze the current price and give me a reasoned recommendation."
}
],
tools=tools,
extra_body={"thinking": {"type": "enabled"}}
)
# Access the chain-of-thought reasoning
for choice in response.choices:
if hasattr(choice.message, "reasoning_content"):
print("Thinking:", choice.message.reasoning_content)
print("Answer:", choice.message.content)
Note the extra_body={"thinking": {"type": "enabled"}} parameter — this activates thinking mode. Without it, tool calls still work; you just don't get the reasoning chain. With it, you get both.
For streaming responses (recommended for production to avoid timeout on long reasoning chains):
stream = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Analyze this problem step by step..."}],
tools=tools,
extra_body={"thinking": {"type": "enabled"}},
stream=True
)
reasoning_buffer = []
answer_buffer = []
for chunk in stream:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
reasoning_buffer.append(delta.reasoning_content)
if delta.content:
answer_buffer.append(delta.content)
print("Reasoning:", "".join(reasoning_buffer))
print("Answer:", "".join(answer_buffer))
One critical implementation detail: when you pass prior turns back into a multi-turn conversation, include only content from assistant messages, not reasoning_content. The reasoning chain is output-only; including it in the next request causes a 400 error.
Getting Started: First API Call in 5 Minutes
DeepSeek's API is fully compatible with the OpenAI SDK. If you already have OpenAI-based code, switching is two lines:
pip install openai
from openai import OpenAI
client = OpenAI(
api_key="YOUR_DEEPSEEK_API_KEY", # from platform.deepseek.com
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-chat", # V3.2 model
messages=[
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "Explain why MoE models activate fewer parameters per token."}
]
)
print(response.choices[0].message.content)
Model names on the DeepSeek API:
deepseek-chat→ DeepSeek V3.2 (general-purpose, tool use, thinking)deepseek-reasoner→ DeepSeek R2 (pure reasoning, no tool use)
For accessing V3.2 specifically through LiteLLM as a proxy (which handles caching, fallback, and cost tracking across 140+ providers):
import litellm
response = litellm.completion(
model="deepseek/deepseek-chat",
messages=[{"role": "user", "content": "Hello"}],
extra_body={"thinking": {"type": "enabled"}}
)
Self-Hosting DeepSeek V3.2
Model weights are available on Hugging Face at deepseek-ai/DeepSeek-V3.2. The model is MIT licensed — you can download, modify, and deploy without usage restrictions or revenue sharing.
The catch: 685B parameters in BF16 requires approximately 1.4 TB of GPU VRAM. A realistic production deployment needs:
- 8× H100 80GB (minimum for BF16)
- 4× H200 141GB (comfortable, with headroom for batching)
For organizations without that hardware budget, quantized variants are available:
QuantTrio/DeepSeek-V3.2-AWQ— AWQ 4-bit quantization, reduces memory to roughly 350GB VRAM
The recommended serving framework is vLLM, which has a dedicated recipe for V3.2:
pip install vllm
vllm serve deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--trust-remote-code
Once running, vLLM exposes an OpenAI-compatible endpoint at http://localhost:8000/v1, so all client code stays unchanged — just swap the base_url.
For monitoring your self-hosted deployment, Langfuse integrates directly with the OpenAI-compatible endpoint to trace requests, track latency, and monitor token costs without any DeepSeek-specific configuration.
Vertex AI: Managed Deployment Without the Infrastructure
If you want managed inference without running your own GPU cluster, Google Cloud's Vertex AI now includes DeepSeek-V3.2 in its Model Garden:
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project="YOUR_PROJECT_ID", location="us-central1")
model = GenerativeModel("publishers/deepseek-ai/models/deepseek-v3-2")
response = model.generate_content(
"Explain the architecture of MoE transformers."
)
print(response.text)
Vertex AI handles scaling, availability, and data residency. If your organization requires data to stay within Google Cloud infrastructure, this is the path to V3.2 without the self-hosting overhead. A deployment notebook is available in the GoogleCloudPlatform/vertex-ai-samples GitHub repository for teams that want to deploy the open weights to their own Vertex AI endpoint.
Pricing and Cost Analysis
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cache Discount |
|---|---|---|---|
| DeepSeek V3.2 | $0.14 | $0.28 | 90% off (cached) |
| GPT-5 Standard | $1.25 | $10.00 | 50% off |
| Claude Opus 4 | $15.00 | $75.00 | ~90% (prompt cache) |
| Gemini 3.1 Pro | $1.25 | $5.00 | 75% off |
The 90% cache discount is particularly powerful for agentic workloads where the same system prompt and context prefix repeats across thousands of requests — a pattern common in RAG pipelines. If you prefix every request with a 50,000-token knowledge base, V3.2's cached input rate makes that effectively free at scale.
When pairing V3.2 with structured output libraries, the combination of thinking mode and Pydantic validation provides both reasoning transparency and output reliability — without the premium cost of frontier closed models.
Common Mistakes to Avoid
Passing reasoning_content back into conversation history — Do not include reasoning_content from prior assistant turns when building the messages array for the next request. The API only accepts content in historical messages. Including reasoning_content returns a 400 error.
Using deepseek-reasoner when you need tools — deepseek-reasoner (R2) does not support tool calls. For thinking-plus-tools workflows, you need deepseek-chat (V3.2) with the thinking parameter enabled.
Expecting thinking mode for free — The thinking mode does consume additional tokens for the reasoning chain. The chain is not billed separately, but longer reasoning chains increase total token usage. For simple queries, consider leaving thinking disabled.
Ignoring tensor parallelism for self-hosting — Running V3.2 on a single multi-GPU node requires --tensor-parallel-size to match the number of GPUs. Setting this incorrectly causes out-of-memory errors at load time. The vLLM recipe documentation specifies recommended configurations per GPU count.
Not setting max-model-len for long context — The default vLLM maximum sequence length may truncate requests below 128K tokens. Set --max-model-len 131072 explicitly if you need the full context window.
FAQ
Q: Can I use DeepSeek V3.2 with the OpenAI Python SDK without any changes?
Yes, with two modifications: set base_url="https://api.deepseek.com" and use your DeepSeek API key. Model names differ — use deepseek-chat instead of gpt-4o. All other SDK features (streaming, tool calls, system messages, logprobs) work unchanged.
Q: What is the difference between DeepSeek V3.2 and DeepSeek R2?
V3.2 (deepseek-chat) is the general-purpose model with tool use, function calling, and optional thinking mode. R2 (deepseek-reasoner) is a pure reasoning model optimized for mathematical and logical problems; it does not support tool calls. Use V3.2 when you need both reasoning and action; use R2 when you only need deep reasoning with no external tool invocations.
Q: Is DeepSeek V3.2 safe to use in production for enterprise workloads?
For EU and US enterprises with data sovereignty requirements, the managed DeepSeek API routes traffic through servers outside those jurisdictions. Vertex AI's managed endpoint keeps data within Google Cloud regions and provides enterprise SLAs. Self-hosting on your own infrastructure provides full control. Assess your compliance requirements before choosing a deployment path.
Q: How does V3.2 compare to fine-tuned domain-specific models?
For most tasks, V3.2's instruction-tuned base is strong enough to outperform smaller domain-specific fine-tunes. If you need V3.2's performance on a narrow domain with specific output formats, the MIT license permits fine-tuning via LoRA or QLoRA. See the fine-tuning guide for practical setup instructions.
Q: Does V3.2 support function calling in JSON Schema format?
Yes. The tool definition format is identical to OpenAI's — a tools array with type: "function" entries, each containing a JSON Schema parameters block. Responses include tool_calls objects with function.name and function.arguments.
Key Takeaways
- DeepSeek V3.2 is the first model to support chain-of-thought reasoning while making tool calls — removing the reasoning-or-action tradeoff that plagued earlier agentic architectures
- 685B MoE parameters with 37B active per token, 128K context, DeepSeek Sparse Attention for long-context efficiency
- MIT license with weights on Hugging Face — fully self-hostable on 8× H100 or via quantized variants on smaller setups
- $0.14/M input tokens on the managed API with 90% cache discounts, making it the most cost-efficient frontier-class model available
- Drop-in OpenAI SDK compatible: change
base_urlandapi_key, keep the rest of your code - Available on Vertex AI Model Garden for teams needing managed inference within Google Cloud
DeepSeek V3.2 is the most practical open-source model for production agentic systems in 2026. The combination of thinking-with-tools, MIT licensing, competitive benchmark performance, and sub-$0.30 pricing makes it the obvious starting point for any new LLM infrastructure project that isn't locked into a specific cloud provider's ecosystem.
Prefer a deep-dive walkthrough? Watch the full video on YouTube.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.