DeepSeek dropped V4 on April 24, 2026. 1.6T MoE model, 1M token default context, Huawei Ascend-native, and API pricing that makes GPT-4 Turbo look like a luxury tax. I went through the technical report and ran some cost scenarios. Sharing what I found
V4-Flash output is $0.28/M. GPT-4 Turbo output is $30/M. That's a 107x difference.
A production chatbot at 1M queries/month (avg 500 output tokens each):
V4-Flash: ~$154/month
GPT-4 Turbo: ~$25,000/month
That's not "cheaper" — that's a different product category. Side project budget vs. enterprise budget required.
This isn't just aggressive pricing. The cost reduction is structural.
1. Engram (memory-compute separation)
Static knowledge — facts, world knowledge, trained associations — lives in CPU RAM. Dynamic reasoning runs on GPU. CPU RAM costs 10-20x less per GB than GPU HBM.
The practical effect: 1M context is the default on all tiers, including Flash, because extending context doesn't proportionally increase GPU memory usage. The Engram layer handles retrieval from CPU-side memory; the GPU handles attention over the retrieved context. You're not paying GPU HBM costs for the full 1M token window.
2. DSA (DeepSeek Sparse Attention)
Standard attention is O(n²) in sequence length. DSA compresses at the token dimension, not just the head dimension — reducing attention to near-linear scaling. The paper reports 60-70% memory bandwidth reduction per attention layer.
This is what makes 1M context inference feasible at Flash-tier prices. Without DSA, 1M context at $0.28/M output would require either massive hardware subsidies or a loss-leader strategy. With DSA, the compute cost actually supports the price.
3. mHC (Manifold-Constrained Hyper-Connections)
Training a 1.6T MoE model is expensive. Failed training runs are catastrophically expensive. mHC projects layer connections onto a bi-stochastic matrix manifold using Sinkhorn-Knopp normalization, enforcing signal conservation through the network. This prevents training collapse — the kind of instability that kills large MoE runs mid-training.
Fewer failed training runs = lower amortized training cost = lower API prices. The pricing isn't just a go-to-market decision; it's partly a consequence of training efficiency.
V4 is the first Tier-1 model fully co-optimized for Huawei Ascend chips (910B and 950). Not ported from CUDA — built natively with CANN (Compute Architecture for Neural Networks).
Previous ceiling on Ascend for production inference: ~60% utilization. V4 on Ascend: 85%+. That 25-point gap translates to roughly 40% hardware cost reduction at inference scale.
The context: U.S. export restrictions have cut off Chinese AI labs from Nvidia H100/H200 supply. DeepSeek's response wasn't to find workarounds — it was to co-engineer with Huawei at the kernel level. MoE routing, sparse attention, and Engram's memory retrieval operations are all tuned specifically for Ascend silicon.
One day before V4 release, DeepSeek refused early API access to U.S. chip manufacturers including Nvidia. Two parallel AI infrastructure stacks are now emerging. As a builder, you're going to have to decide which one you're building on — or how to stay portable between them.
The API is OpenAI-compatible, so migration is minimal:
from openai import OpenAI
client = OpenAI(
api_key="your_deepseek_api_key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-flash", # or deepseek-v4-pro
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Analyze this document..."}
],
max_tokens=2048
)
print(response.choices[0].message.content)Change base_url and model, keep everything else. If you're already on OpenAI's SDK, the switch is two lines.
The 1M context window changes what's architecturally feasible. RAG pipelines exist partly because context windows were too small and too expensive to stuff full documents into. At $0.28/M output with 1M context, you can reconsider whether chunking + retrieval is actually the right architecture for your use case, or whether "just send the whole thing" is now viable.
The two-stack question is real. If you're building something that needs to run in China or on Ascend infrastructure, V4 is now a first-class option, not a compromise. If you're building for global deployment, you need to think about whether your architecture is portable or locked to one stack.
The post-scaling paradigm shift: V4's innovations (Engram, DSA, mHC) are all about efficiency at scale, not raw parameter count. The next generation of competitive models probably won't be "bigger" — they'll be "more efficient at the same size." That changes how you evaluate model releases going forward.
API: api.deepseek.com
Technical report: DeepSeek V4 PDF
Weights: HuggingFace — DeepSeek-V4-Pro
Try DeepSeek Chat: EvoLink
Would be curious what others are finding:
Anyone tested the 1M context on real document workloads? What's the latency like at that scale?
Has anyone benchmarked V4-Flash vs. V4-Pro on coding tasks specifically?
If you're building something that needs to run on Ascend infrastructure, what does your stack look like?
#show #ai #deepseek #llm #api #machinelearning #builders
DeepSeek-V4: Running on Huawei Chips, Priced 100x Below GPT-4 — A Complete Technical Breakdown
DeepSeek-V4 is a 1.6 trillion parameter Mixture-of-Experts large language model released on April 24, 2026 by DeepSeek, a Chinese AI research lab. V4 combines 1 million token default context, Agent-level capabilities rivaling Claude Opus 4.6, and API pricing that undercuts every major Western AI provider by one to two orders of magnitude. V4-Flash costs $0.28 per million output tokens while GPT-4 Turbo charges $30 for the same volume. Beyond the pricing shock, V4 is the first Tier-1 foundation model fully optimized for Huawei Ascend chips, achieving 85%+ hardware utilization on domestic Chinese silicon — a milestone that reshapes the global AI infrastructure landscape.
In October 2022, the US Bureau of Industry and Security imposed sweeping export controls on advanced semiconductors and GPU hardware to China. The restrictions targeted Nvidia A100, H100, and subsequent high-end accelerators, cutting off Chinese AI labs from the hardware that powered virtually every frontier model at the time. Chinese labs faced three paths forward: stockpile pre-ban Nvidia chips (a finite and depreciating strategy), run workloads on older-generation GPUs with lower throughput, or invest in domestic chip alternatives and optimize software stacks to close the performance gap.
DeepSeek chose the third path, and V4 is the proof that it works at production scale.
Most domestic Chinese accelerators struggle to exceed 60% utilization on real inference workloads. The gap between theoretical peak FLOPS and actual sustained throughput has been the core bottleneck — not raw chip capability, but software maturity, memory bandwidth management, and operator-level optimization.
DeepSeek achieved 85%+ sustained utilization on Huawei Ascend 910B and the newer Ascend 950 through three engineering efforts:
1. Kernel-level co-optimization with Huawei's CANN framework. DeepSeek engineers worked directly with Huawei's computing architecture team to rewrite critical kernels — specifically MoE (Mixture-of-Experts) routing operations and sparse attention computations — for the CANN (Compute Architecture for Neural Networks) software stack. Standard deep learning frameworks treat the GPU as a generic compute target; DeepSeek's approach treats Ascend's specific memory hierarchy, interconnect topology, and instruction scheduling as first-class optimization targets.
2. Operator fusion redesigned for Ascend memory bandwidth. Transformer inference is memory-bandwidth-bound, not compute-bound, for most practical workloads. DeepSeek redesigned operator fusion patterns specifically for Ascend's HBM bandwidth profile, reducing the number of round-trips between on-chip SRAM and HBM. This is particularly impactful for MoE models where expert routing creates irregular memory access patterns that naive implementations handle poorly.
3. Production validation at scale. DeepSeek ran extensive A/B comparisons between Ascend-based and Nvidia A100-based inference clusters. The result: equivalent output quality with approximately 40% lower total hardware cost on Ascend deployments. This cost advantage comes from both lower chip acquisition costs and Huawei's aggressive pricing for cloud-based Ascend instances.
V4's Ascend optimization is not a research demo or a benchmark stunt. It is a production deployment serving real API traffic. This matters because it demonstrates that the Chinese domestic chip ecosystem has crossed the viability threshold for frontier AI inference — not matching Nvidia's latest hardware generation, but delivering competitive price-performance on the workloads that actually matter for commercial deployment.
V4 introduces three novel architectural components that collectively enable its performance profile. Each addresses a specific bottleneck in scaling large MoE models.
Traditional Transformer architectures store all learned knowledge within GPU-resident weight matrices. Every inference pass loads these weights from GPU HBM, performs matrix multiplications, and writes results back. As models scale to trillions of parameters, this creates a fundamental tension: you need enormous GPU memory to hold the weights, but most of that memory sits idle during any given forward pass (especially in MoE models where only a fraction of experts activate per token).
V4's Engram architecture splits the model into two distinct subsystems:
- Static knowledge retrieval module — resides in CPU RAM. Uses hash-based lookup tables to retrieve factual knowledge, entity relationships, and learned associations. This module does not require GPU compute; it functions more like a high-speed database query than a neural network forward pass.
- Dynamic reasoning module — resides on GPU. Handles the actual reasoning, planning, chain-of-thought, and generation tasks. This module is smaller, faster, and focused purely on computation rather than knowledge storage.
This separation is why V4 can offer 1 million tokens as the default context window without astronomical GPU memory costs. The static knowledge layer scales with CPU RAM (cheap, abundant, easily expandable), while the dynamic reasoning layer scales with GPU compute (expensive, but now handling a much smaller working set). The architecture effectively decouples the "how much the model knows" dimension from the "how hard the model can think" dimension.
Training a 1.6 trillion parameter MoE model is notoriously unstable. At this scale, gradient signals must propagate through hundreds of layers and across thousands of expert modules. Small numerical instabilities compound exponentially, leading to training collapse events — sudden divergences where loss spikes and the model's learned representations degrade irreversibly. Previous approaches relied on careful learning rate scheduling, gradient clipping, and extensive hyperparameter tuning to manage stability. These are fragile solutions that often require human intervention during training runs.
V4's mHC (Manifold-Constrained Hyper-Connections) takes a fundamentally different approach. Instead of treating stability as a hyperparameter tuning problem, mHC enforces it as a mathematical constraint:
- Bi-stochastic matrix manifold. Layer-to-layer connections are projected onto a bi-stochastic matrix manifold — a mathematical space where every matrix has rows and columns that each sum to exactly 1. This guarantees that signal magnitude is conserved as information flows through the network. No signal amplification, no signal decay.
- Sinkhorn-Knopp algorithm. The projection onto the bi-stochastic manifold is computed using the Sinkhorn-Knopp iterative algorithm, which alternately normalizes rows and columns until convergence. This is computationally cheap (a few iterations suffice) and differentiable, so it integrates seamlessly into backpropagation.
- Signal conservation at every node. The practical effect is that every layer in the network receives inputs of consistent magnitude and produces outputs of consistent magnitude. This eliminates the primary cause of training instability in deep MoE models — the compounding of small per-layer magnitude shifts across hundreds of layers.
The result: DeepSeek trained V4's full 1.6T parameter MoE architecture without the training collapse events that have plagued other attempts at this scale. mHC makes ultra-deep, ultra-wide MoE training reliable rather than heroic.
Standard self-attention computes pairwise interactions between all tokens in a sequence, producing O(n²) complexity in both compute and memory. For a 1 million token context window, naive attention is computationally impossible. Previous solutions (FlashAttention, sliding window, etc.) reduce the constant factors or limit the attention window, but V4's DSA takes a different approach.
DSA compresses attention at the token dimension, not just the head dimension. Most efficient attention methods reduce the number of attention heads or the dimensionality of key/value projections. DSA instead identifies which tokens carry meaningful signal for a given query and compresses the token sequence itself before computing attention scores. Combined with learned sparse patterns that adapt per-layer, this achieves:
- Near-linear scaling with sequence length, reducing the O(n²) bottleneck to approximately O(n log n) in practice
- 60-70% reduction in memory bandwidth requirements during inference, which directly translates to higher throughput on bandwidth-constrained hardware (including Ascend chips)
- No quality degradation on long-context tasks — the compression is learned and adaptive, preserving the tokens that matter for each specific attention computation
DSA is particularly synergistic with the Engram architecture: since static knowledge retrieval happens off-GPU, the attention mechanism only needs to handle the dynamic reasoning context, further reducing the effective sequence length that DSA must process.
V4's pricing is the most immediately impactful aspect for developers and businesses evaluating AI API costs.
These scenarios use realistic token volumes for common production workloads.
Scenario 1: Long-context document analysis (500-page document)
Processing a 500-page document involves approximately 200K input tokens and 10K output tokens for summarization and analysis.
- GPT-4 Turbo: (200K × $10/1M) + (10K × $30/1M) = $2.00 + $0.30 = $2.30
- DeepSeek V4-Pro: (200K × $0.55/1M) + (10K × $2.19/1M) = $0.11 + $0.02 = $0.13
- Cost reduction: 94% cheaper
Scenario 2: AI coding assistant (1.5M output tokens per month)
A development team using an AI coding assistant that generates approximately 1.5M output tokens monthly, with 500K input tokens from code context.
- Claude Opus 4.6: (500K × $15/1M) + (1.5M × $75/1M) = $7.50 + $112.50 = $120.00/month
- DeepSeek V4-Pro: (500K × $0.55/1M) + (1.5M × $2.19/1M) = $0.28 + $3.29 = $3.57/month
- Cost reduction: 97% cheaper
Scenario 3: High-volume customer chatbot (1M queries per month)
A chatbot handling 1M queries monthly, averaging 200 input tokens and 500 output tokens per query (200M input, 500M output tokens total).
- GPT-4 Turbo: (200M × $10/1M) + (500M × $30/1M) = $2,000 + $15,000 = $17,000/month
- DeepSeek V4-Flash: (200M × $0.014/1M) + (500M × $0.28/1M) = $2.80 + $140 = $142.80/month
- Cost reduction: 99% cheaper
These are not cherry-picked comparisons. For virtually any workload, V4 pricing represents a 10x to 100x cost reduction compared to equivalent Western API providers.
One day before V4's public release, Reuters reported that DeepSeek had refused to grant early API access to US chip manufacturers, including Nvidia. According to the report, Nvidia had requested preview access to benchmark V4 on its latest hardware — a standard practice in the AI industry where model developers and chip makers collaborate on optimization. DeepSeek declined, citing "strategic considerations."
This refusal mirrors the US government's October 2022 ban on exporting high-end GPUs to China. The symmetry is deliberate and unmistakable: the US restricted hardware access to Chinese AI labs, and now a Chinese AI lab is restricting software access to US hardware companies. Whether this represents official Chinese government policy or DeepSeek's independent decision, the signal is clear.
The strategic implication extends beyond any single company. If Chinese labs can deploy competitive frontier models on domestic hardware at a fraction of Western pricing, the global AI supply chain is splitting into two parallel and increasingly independent stacks:
- Western stack: Nvidia GPUs → AWS/Azure/GCP cloud → OpenAI/Anthropic/Google models → Western developer ecosystem
- Chinese stack: Huawei Ascend chips → Huawei Cloud/Alibaba Cloud → DeepSeek/Qwen models → Chinese and Global South developer ecosystem
This bifurcation has implications for every company building on AI APIs. Model provenance, data sovereignty, and hardware supply chain dependencies become strategic decisions, not just technical ones.
V4 represents something broader than a single model release. It challenges the prevailing assumption in Western AI labs that frontier capability requires frontier hardware and frontier budgets.
The "scaling laws" narrative — that model quality improves predictably with more compute, more data, and more parameters — implicitly assumed that compute means Nvidia GPUs and budgets mean billions of dollars. DeepSeek has systematically challenged this assumption across multiple model generations:
- V2 demonstrated that MoE architectures could match dense model quality at a fraction of the training compute
- V3 showed that architectural innovation (MLA, DeepSeekMoE) could substitute for raw scale
- V4 proves that the entire stack — from chip-level optimization to novel architectures to production deployment — can operate outside the Nvidia ecosystem entirely
This is a paradigm shift from "scale compute to improve models" to "improve architecture to reduce compute requirements." The implications for AI economics are profound: if intelligence per dollar continues to improve through architectural innovation rather than hardware scaling, the cost of AI capabilities will drop faster than Moore's Law alone would predict.
- Price pressure on Western providers. OpenAI, Anthropic, and Google will face pressure to reduce API pricing. Expect aggressive price cuts within 3-6 months, particularly on high-volume and batch workloads.
- Multi-model routing becomes standard practice. Smart API routing — sending cheap tasks to V4-Flash, complex reasoning to Claude Opus 4.6, and code generation to specialized models — will become table stakes for production AI applications. The cost differential is too large to ignore.
- Compliance and provenance auditing. Enterprise compliance teams will need to audit which models process which data. Regulated industries (finance, healthcare, government) may restrict use of Chinese-origin models regardless of cost advantages, creating a two-tier market.
- Two parallel AI ecosystems. Developers will increasingly need to choose — or bridge — between Western and Chinese AI stacks. Abstraction layers and model-agnostic frameworks will become critical infrastructure.
- Intelligence as commodity infrastructure. AI inference is following the trajectory of CDN bandwidth, cloud storage, and compute instances — rapidly commoditizing toward marginal cost. V4's pricing accelerates this trend by years.
- Open-source fragmentation. As model weights are optimized for different hardware stacks (CUDA vs. CANN), the open-source AI community may fragment along hardware lines. Models that run efficiently on Nvidia may not run efficiently on Ascend, and vice versa.
curl https://api.deepseek.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-pro",
"messages": [{"role": "user", "content": "Explain quantum entanglement"}],
"max_tokens": 1000
}'from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.deepseek.com/v1"
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Analyze the trade-offs between microservices and monolithic architectures."}
],
max_tokens=2000,
temperature=0.7
)
print(response.choices[0].message.content)- *deepseek-v4-pro** — Flagship model. Agent-optimized with full tool-use support, function calling, and structured output. Best for complex reasoning, multi-step tasks, and code generation.
- *deepseek-v4-flash** — Faster inference, lower cost, 98% of Pro's reasoning quality. Best for high-volume workloads, chatbots, and latency-sensitive applications.
For complex Agent workflows that benefit from extended chain-of-thought reasoning:
{
"model": "deepseek-v4-pro",
"messages": [{"role": "user", "content": "Debug this code and suggest fixes..."}],
"reasoning_mode": true,
"reasoning_effort": "max"
}V4 weights are available for self-hosting and fine-tuning:
Weights are released under the DeepSeek License, which permits commercial use with attribution.
Resources:
Disclosure: This analysis is based on publicly available information, DeepSeek's technical report, and third-party reporting. The author has no financial relationship with DeepSeek, Huawei, Nvidia, OpenAI, Anthropic, Google, or any other AI provider mentioned in this article. Pricing figures are based on published API rate cards as of April 2026 and may change.
0
0
0