The headline numbers
At MWC Barcelona 2026, Huawei unveiled the Atlas 950 SuperPoD — the company’s next-generation AI infrastructure platform. The numbers demand attention:
- 8,192 Ascend 950DT NPUs in a single logical machine
- 8 EFLOPS in FP8, 16 EFLOPS in FP4
- 16 PB/s interconnect bandwidth (more than 10x the entire globe’s peak internet bandwidth)
- 1,152 TB total memory
- 160 cabinets (128 compute, 32 communications) across 1,000 square meters
- All-optical interconnect via UnifiedBus 2.0
For context: NVIDIA’s planned NVL144 system connects 144 Blackwell Ultra GPUs. The Atlas 950 SuperPoD has 56.8 times more processing units and claims 6.7 times more computing power. Whether you view these as directly comparable or not, the scale is unprecedented.
The Atlas 950 SuperPoD is scheduled for Q4 2026.
The Ascend 950DT chip
The SuperPoD is built on the Ascend 950DT — Huawei’s next-generation neural processing unit optimized for the decode stage of inference and model training. Key specs:
- 1 PFLOPS in FP8/MXFP8/HiF8
- 2 PFLOPS in MXFP4
- 144 GB HiZQ 2.0 HBM with 4 TB/s memory access bandwidth
- 2 TB/s interconnect bandwidth (2.5x the Ascend 910C)
- Support for FP8, MXFP8, MXFP4, and Huawei’s proprietary HiF8 format (FP16-class precision at FP8-class efficiency)
The companion chip, Ascend 950PR, is optimized for the prefill stage and recommendation systems, using lower-cost HiBL 1.0 HBM. This split design means you can optimize hardware spend per inference stage — prefill nodes use cheaper memory, decode nodes get maximum bandwidth.
The Ascend roadmap
Huawei has committed to a three-year, annual-cadence chip roadmap:
| Chip | Availability | FP8 Compute | Interconnect BW |
|---|---|---|---|
| Ascend 910C | 2025 (shipping) | Baseline | 800 GB/s |
| Ascend 950 series | Q1 2026 (PR), Q4 2026 (DT) | 1 PFLOPS | 2 TB/s |
| Ascend 960 | Q4 2027 | 2 PFLOPS | Higher |
| Ascend 970 | Q4 2028 | 4 PFLOPS | 4 TB/s |
Each generation doubles compute. The goal: sustain AI computing power growth with the semiconductor process nodes available to China.
UnifiedBus 2.0: Not a network — a memory fabric
This is the architectural innovation that makes the SuperPoD fundamentally different from a traditional GPU cluster.
UnifiedBus is not a networking protocol. It is a memory fabric.
In a conventional cluster, GPUs communicate over network fabric (InfiniBand, RoCE). Each GPU has its own memory, and data movement between GPUs requires explicit network operations — serialize, send, receive, deserialize. This adds latency and complexity at every step.
UnifiedBus integrates directly into the processor via a Unified Bus Memory Management Unit (UBMMU). When a processor executes a load instruction against a remote address, the UBMMU translates it into a UB memory operation and sends it over the optical interconnect. The remote side validates the access and returns the data. This happens transparently to the application.
The practical implication: 8,192 NPUs share a unified memory pool. For the software layer, the SuperPoD looks like one very large computer, not a cluster of independent machines. This is closer to how a traditional shared-memory multiprocessor works than how a networked cluster operates.
Key interconnect specs
- 16 PB/s total interconnect bandwidth across the full SuperPoD
- 2.1 microsecond inter-NPU latency
- 200 meter optical range within the data center
- 100x more reliable than conventional optical interconnect (100-ns fault detection and protection switching)
- 100% all-optical — combines “copper reliability with optical range”
UnifiedBus 2.0 is open
In a move that surprised many observers, Huawei has open-sourced the UnifiedBus 2.0 technical specifications. Where NVIDIA’s NVLink remains proprietary and tightly coupled to NVIDIA hardware, Huawei is inviting industry partners to adopt UnifiedBus and develop compatible products.
The stated strategy: monetize hardware (chip sales), not software or protocols. This is a deliberate trade of short-term lock-in for ecosystem growth.
The software stack
Hardware without software is expensive metal. Here is how the Atlas 950 SuperPoD is programmed:
CANN: Huawei’s CUDA equivalent
CANN (Compute Architecture for Neural Networks) is the low-level compute framework. Version 8.0 is the current release. Huawei has committed to:
- Open-sourcing operator libraries, acceleration libraries, graph engines, and programming languages
- Full open source and open access for CANN by end of 2025 (based on Ascend 910B/910C)
- Synchronizing open source plans for future versions with product launches
CANN supports PyTorch via torch_npu — a backend plugin using PyTorch’s PrivateUse1 mechanism. You can take existing PyTorch code and run it on Ascend hardware with minimal changes. CANN also supports vLLM, SGLang, xLLM, verl, Triton, and TileLang.
Is CANN as mature as CUDA? No — CUDA has 15+ years of ecosystem development. But the gap is narrowing, and the open-source commitment accelerates adoption.
openFuyao: Kubernetes for the SuperPoD
openFuyao is the cluster orchestration layer that runs on top of the SuperPoD. It provides:
- NUMA-aware scheduling — critical when NPUs have non-uniform memory access patterns
- Ultra-large cluster scheduling — optimized for 10,000+ node clusters
- NPU Operator — manages Ascend NPUs as Kubernetes resources, with fractional allocation
- KAE Operator — integrates Kunpeng Acceleration Engine hardware
- AI inference suite — KVCache optimization, intelligent routing, cache hit strategies
- Colocation scheduling — mixes online and offline workloads for 30% better CPU utilization
openFuyao is to the Atlas SuperPoD what OpenShift is to a Red Hat infrastructure stack — the platform layer that makes the hardware accessible to application teams.
openEuler: The OS layer
openEuler provides the operating system, including:
- kubeOS — containerized, immutable OS for Kubernetes nodes
- Multi-architecture support (ARM/Kunpeng, x86, RISC-V)
- Multi-kernel architecture (Linux + UniProton RTOS)
- 16 million+ installations, governed by the OpenAtom Foundation
The full stack: openEuler (OS) → openFuyao (Kubernetes/scheduling) → CANN (compute framework) → Ascend NPUs (hardware) → UnifiedBus (interconnect).
Beyond the SuperPoD: SuperClusters
Huawei is not stopping at 8,192 NPUs:
Atlas 950 SuperCluster (Q4 2026)
- 64 Atlas 950 SuperPoDs combined
- 520,000+ Ascend 950DT NPUs
- 524 EFLOPS in FP8
- 10,000+ cabinets
- Supports both UBoE (UnifiedBus over Ethernet) and RoCE protocols
For scale: xAI’s Colossus — currently the world’s largest computing cluster — would have 2.5x fewer processing units and less computing power.
Atlas 960 SuperCluster (Q4 2027)
- 1 million+ NPUs
- 2 ZFLOPS in FP8, 4 ZFLOPS in FP4
- Built on Ascend 960 chips
These are designed for training models with over 1 trillion parameters and the next generation of physical AI systems.
The performance claims in context
Huawei claims the Atlas 950 SuperPoD achieves 95% compute efficiency across 8,192 NPUs. If accurate, this is remarkable — most large GPU clusters see significant efficiency degradation beyond a few hundred GPUs due to communication overhead.
The claimed training throughput: 4.91 million tokens per second (17x improvement over the Atlas 900 A3). Inference throughput with FP4: 19.6 million tokens per second (26.5x improvement).
These numbers need independent validation. But even at 70-80% of claimed performance, the system would represent a significant capability.
What this means for AI infrastructure strategy
The end of the single-vendor assumption
For years, “AI infrastructure” meant “NVIDIA.” The Atlas 950 SuperPoD demonstrates that this assumption is increasingly outdated. Whether or not you deploy Huawei hardware, the existence of a competitive alternative changes the market dynamics:
- Pricing pressure on NVIDIA hardware and InfiniBand networking
- Architectural diversity — different approaches to the same problem (memory fabric vs. network fabric)
- Supply chain resilience — a second source for AI compute at scale
- Open interconnect standards — UnifiedBus 2.0 being open-sourced pressures NVLink’s proprietary model
Memory fabric vs. network fabric
The deepest architectural difference is UnifiedBus’s memory fabric approach versus NVIDIA’s network fabric approach. Memory fabric makes large-scale training easier to program (unified address space), but it is a fundamentally different model that requires different software optimization.
For teams evaluating AI infrastructure, the question is not “which chip is faster” but “which architecture better matches our workload patterns.” Models that benefit from massive parameter sharing across devices (trillion-parameter training) may favor the memory fabric approach. Models that can be efficiently partitioned may work better on traditional network-connected GPU clusters.
The open ecosystem bet
Huawei’s decision to open-source UnifiedBus, CANN, and their software stack is a strategic choice: grow the ecosystem faster than you could grow alone. This is the same playbook that made Linux, Kubernetes, and Android successful.
If UnifiedBus gains adoption beyond Huawei hardware, it could become a standard for high-performance interconnects — similar to how Ethernet displaced proprietary networking protocols in the datacenter.
The full Huawei AI infrastructure stack
| Layer | Component | Role |
|---|---|---|
| Application | MindSpore, PyTorch (torch_npu) | Model development |
| Platform | openFuyao | Kubernetes cluster orchestration |
| Compute framework | CANN 8.0 | NPU programming (CUDA equivalent) |
| OS | openEuler | Enterprise Linux |
| Interconnect | UnifiedBus 2.0 | Memory fabric protocol |
| Networking | Xinghe AI Fabric 2.0 | Ethernet-based data center networking |
| Storage | OceanStor A800/A600 | AI-optimized storage with UCM KV cache |
| Compute | Ascend 950DT/950PR NPUs | Neural processing units |
| System | Atlas 950 SuperPoD | 8,192 NPU single logical machine |
This is one of the few truly full-stack AI infrastructure offerings in the world — from chip to application framework, controlled by a single vendor but increasingly open-sourced.
Storage: The overlooked layer
Huawei’s OceanStor deserves separate attention:
- OceanStor A800: 500 GB/s bandwidth, 24M IOPS per 8U enclosure. Ranked first in MLPerf Storage v2.0 benchmarks.
- OceanStor A600: Supports vectors, tensors, and KV cache natively. Claims 78% reduction in TTFT.
- UCM (Unified Cache Manager): Three-tier KV cache management — L1 (GPU HBM), L2 (Host DRAM), L3 (SSD/NVMe). Claims 90% TTFT reduction and 10x inference throughput improvement for long-sequence scenarios.
Storage is often the bottleneck in AI training pipelines. Having storage that natively understands AI data patterns (tensors, KV cache) rather than treating everything as files is a meaningful architectural advantage.
Looking ahead
The Atlas 950 SuperPoD arrives Q4 2026. The Atlas 960 SuperPoD (15,488 NPUs, 30 EFLOPS FP8) follows in Q4 2027. The million-NPU SuperCluster is on the same timeline.
Whether these systems deliver on their promises remains to be seen. But the architectural ambition — memory fabric interconnect, open protocols, full-stack integration — represents a genuinely different approach to AI infrastructure than what the West is building.
For platform engineers and infrastructure architects, understanding this stack is no longer optional. It is part of the landscape.
Related: openFuyao: Kubernetes Cluster Computing, openEuler: Enterprise Linux, GPU Sharing on Kubernetes: MIG, MPS, Time-Slicing, Multi-Tenant GPU Platform Operating Model. Need help with AI infrastructure strategy? Book a consultation.