Skip to main content
🎓 Claude Code Masterclass Learn AI-assisted development on Udemy — plus the companion book on Leanpub & Amazon. Start Learning
Huawei Atlas 950 AI SuperPoD architecture with 8192 Ascend NPUs and UnifiedBus 2.0
AI

Huawei Atlas 950 AI SuperPoD: 8,192 NPUs as One Machine

Huawei Atlas 950 SuperPoD connects 8,192 Ascend NPUs via UnifiedBus 2.0 into one machine delivering 8 EFLOPS. Full architecture from chips to openFuyao.

LB
Luca Berton
· 8 min read

The headline numbers

At MWC Barcelona 2026, Huawei unveiled the Atlas 950 SuperPoD — the company’s next-generation AI infrastructure platform. The numbers demand attention:

  • 8,192 Ascend 950DT NPUs in a single logical machine
  • 8 EFLOPS in FP8, 16 EFLOPS in FP4
  • 16 PB/s interconnect bandwidth (more than 10x the entire globe’s peak internet bandwidth)
  • 1,152 TB total memory
  • 160 cabinets (128 compute, 32 communications) across 1,000 square meters
  • All-optical interconnect via UnifiedBus 2.0

For context: NVIDIA’s planned NVL144 system connects 144 Blackwell Ultra GPUs. The Atlas 950 SuperPoD has 56.8 times more processing units and claims 6.7 times more computing power. Whether you view these as directly comparable or not, the scale is unprecedented.

The Atlas 950 SuperPoD is scheduled for Q4 2026.

The Ascend 950DT chip

The SuperPoD is built on the Ascend 950DT — Huawei’s next-generation neural processing unit optimized for the decode stage of inference and model training. Key specs:

  • 1 PFLOPS in FP8/MXFP8/HiF8
  • 2 PFLOPS in MXFP4
  • 144 GB HiZQ 2.0 HBM with 4 TB/s memory access bandwidth
  • 2 TB/s interconnect bandwidth (2.5x the Ascend 910C)
  • Support for FP8, MXFP8, MXFP4, and Huawei’s proprietary HiF8 format (FP16-class precision at FP8-class efficiency)

The companion chip, Ascend 950PR, is optimized for the prefill stage and recommendation systems, using lower-cost HiBL 1.0 HBM. This split design means you can optimize hardware spend per inference stage — prefill nodes use cheaper memory, decode nodes get maximum bandwidth.

The Ascend roadmap

Huawei has committed to a three-year, annual-cadence chip roadmap:

ChipAvailabilityFP8 ComputeInterconnect BW
Ascend 910C2025 (shipping)Baseline800 GB/s
Ascend 950 seriesQ1 2026 (PR), Q4 2026 (DT)1 PFLOPS2 TB/s
Ascend 960Q4 20272 PFLOPSHigher
Ascend 970Q4 20284 PFLOPS4 TB/s

Each generation doubles compute. The goal: sustain AI computing power growth with the semiconductor process nodes available to China.

UnifiedBus 2.0: Not a network — a memory fabric

This is the architectural innovation that makes the SuperPoD fundamentally different from a traditional GPU cluster.

UnifiedBus is not a networking protocol. It is a memory fabric.

In a conventional cluster, GPUs communicate over network fabric (InfiniBand, RoCE). Each GPU has its own memory, and data movement between GPUs requires explicit network operations — serialize, send, receive, deserialize. This adds latency and complexity at every step.

UnifiedBus integrates directly into the processor via a Unified Bus Memory Management Unit (UBMMU). When a processor executes a load instruction against a remote address, the UBMMU translates it into a UB memory operation and sends it over the optical interconnect. The remote side validates the access and returns the data. This happens transparently to the application.

The practical implication: 8,192 NPUs share a unified memory pool. For the software layer, the SuperPoD looks like one very large computer, not a cluster of independent machines. This is closer to how a traditional shared-memory multiprocessor works than how a networked cluster operates.

Key interconnect specs

  • 16 PB/s total interconnect bandwidth across the full SuperPoD
  • 2.1 microsecond inter-NPU latency
  • 200 meter optical range within the data center
  • 100x more reliable than conventional optical interconnect (100-ns fault detection and protection switching)
  • 100% all-optical — combines “copper reliability with optical range”

UnifiedBus 2.0 is open

In a move that surprised many observers, Huawei has open-sourced the UnifiedBus 2.0 technical specifications. Where NVIDIA’s NVLink remains proprietary and tightly coupled to NVIDIA hardware, Huawei is inviting industry partners to adopt UnifiedBus and develop compatible products.

The stated strategy: monetize hardware (chip sales), not software or protocols. This is a deliberate trade of short-term lock-in for ecosystem growth.

The software stack

Hardware without software is expensive metal. Here is how the Atlas 950 SuperPoD is programmed:

CANN: Huawei’s CUDA equivalent

CANN (Compute Architecture for Neural Networks) is the low-level compute framework. Version 8.0 is the current release. Huawei has committed to:

  • Open-sourcing operator libraries, acceleration libraries, graph engines, and programming languages
  • Full open source and open access for CANN by end of 2025 (based on Ascend 910B/910C)
  • Synchronizing open source plans for future versions with product launches

CANN supports PyTorch via torch_npu — a backend plugin using PyTorch’s PrivateUse1 mechanism. You can take existing PyTorch code and run it on Ascend hardware with minimal changes. CANN also supports vLLM, SGLang, xLLM, verl, Triton, and TileLang.

Is CANN as mature as CUDA? No — CUDA has 15+ years of ecosystem development. But the gap is narrowing, and the open-source commitment accelerates adoption.

openFuyao: Kubernetes for the SuperPoD

openFuyao is the cluster orchestration layer that runs on top of the SuperPoD. It provides:

  • NUMA-aware scheduling — critical when NPUs have non-uniform memory access patterns
  • Ultra-large cluster scheduling — optimized for 10,000+ node clusters
  • NPU Operator — manages Ascend NPUs as Kubernetes resources, with fractional allocation
  • KAE Operator — integrates Kunpeng Acceleration Engine hardware
  • AI inference suite — KVCache optimization, intelligent routing, cache hit strategies
  • Colocation scheduling — mixes online and offline workloads for 30% better CPU utilization

openFuyao is to the Atlas SuperPoD what OpenShift is to a Red Hat infrastructure stack — the platform layer that makes the hardware accessible to application teams.

openEuler: The OS layer

openEuler provides the operating system, including:

  • kubeOS — containerized, immutable OS for Kubernetes nodes
  • Multi-architecture support (ARM/Kunpeng, x86, RISC-V)
  • Multi-kernel architecture (Linux + UniProton RTOS)
  • 16 million+ installations, governed by the OpenAtom Foundation

The full stack: openEuler (OS) → openFuyao (Kubernetes/scheduling) → CANN (compute framework) → Ascend NPUs (hardware) → UnifiedBus (interconnect).

Beyond the SuperPoD: SuperClusters

Huawei is not stopping at 8,192 NPUs:

Atlas 950 SuperCluster (Q4 2026)

  • 64 Atlas 950 SuperPoDs combined
  • 520,000+ Ascend 950DT NPUs
  • 524 EFLOPS in FP8
  • 10,000+ cabinets
  • Supports both UBoE (UnifiedBus over Ethernet) and RoCE protocols

For scale: xAI’s Colossus — currently the world’s largest computing cluster — would have 2.5x fewer processing units and less computing power.

Atlas 960 SuperCluster (Q4 2027)

  • 1 million+ NPUs
  • 2 ZFLOPS in FP8, 4 ZFLOPS in FP4
  • Built on Ascend 960 chips

These are designed for training models with over 1 trillion parameters and the next generation of physical AI systems.

The performance claims in context

Huawei claims the Atlas 950 SuperPoD achieves 95% compute efficiency across 8,192 NPUs. If accurate, this is remarkable — most large GPU clusters see significant efficiency degradation beyond a few hundred GPUs due to communication overhead.

The claimed training throughput: 4.91 million tokens per second (17x improvement over the Atlas 900 A3). Inference throughput with FP4: 19.6 million tokens per second (26.5x improvement).

These numbers need independent validation. But even at 70-80% of claimed performance, the system would represent a significant capability.

What this means for AI infrastructure strategy

The end of the single-vendor assumption

For years, “AI infrastructure” meant “NVIDIA.” The Atlas 950 SuperPoD demonstrates that this assumption is increasingly outdated. Whether or not you deploy Huawei hardware, the existence of a competitive alternative changes the market dynamics:

  • Pricing pressure on NVIDIA hardware and InfiniBand networking
  • Architectural diversity — different approaches to the same problem (memory fabric vs. network fabric)
  • Supply chain resilience — a second source for AI compute at scale
  • Open interconnect standards — UnifiedBus 2.0 being open-sourced pressures NVLink’s proprietary model

Memory fabric vs. network fabric

The deepest architectural difference is UnifiedBus’s memory fabric approach versus NVIDIA’s network fabric approach. Memory fabric makes large-scale training easier to program (unified address space), but it is a fundamentally different model that requires different software optimization.

For teams evaluating AI infrastructure, the question is not “which chip is faster” but “which architecture better matches our workload patterns.” Models that benefit from massive parameter sharing across devices (trillion-parameter training) may favor the memory fabric approach. Models that can be efficiently partitioned may work better on traditional network-connected GPU clusters.

The open ecosystem bet

Huawei’s decision to open-source UnifiedBus, CANN, and their software stack is a strategic choice: grow the ecosystem faster than you could grow alone. This is the same playbook that made Linux, Kubernetes, and Android successful.

If UnifiedBus gains adoption beyond Huawei hardware, it could become a standard for high-performance interconnects — similar to how Ethernet displaced proprietary networking protocols in the datacenter.

The full Huawei AI infrastructure stack

LayerComponentRole
ApplicationMindSpore, PyTorch (torch_npu)Model development
PlatformopenFuyaoKubernetes cluster orchestration
Compute frameworkCANN 8.0NPU programming (CUDA equivalent)
OSopenEulerEnterprise Linux
InterconnectUnifiedBus 2.0Memory fabric protocol
NetworkingXinghe AI Fabric 2.0Ethernet-based data center networking
StorageOceanStor A800/A600AI-optimized storage with UCM KV cache
ComputeAscend 950DT/950PR NPUsNeural processing units
SystemAtlas 950 SuperPoD8,192 NPU single logical machine

This is one of the few truly full-stack AI infrastructure offerings in the world — from chip to application framework, controlled by a single vendor but increasingly open-sourced.

Storage: The overlooked layer

Huawei’s OceanStor deserves separate attention:

  • OceanStor A800: 500 GB/s bandwidth, 24M IOPS per 8U enclosure. Ranked first in MLPerf Storage v2.0 benchmarks.
  • OceanStor A600: Supports vectors, tensors, and KV cache natively. Claims 78% reduction in TTFT.
  • UCM (Unified Cache Manager): Three-tier KV cache management — L1 (GPU HBM), L2 (Host DRAM), L3 (SSD/NVMe). Claims 90% TTFT reduction and 10x inference throughput improvement for long-sequence scenarios.

Storage is often the bottleneck in AI training pipelines. Having storage that natively understands AI data patterns (tensors, KV cache) rather than treating everything as files is a meaningful architectural advantage.

Looking ahead

The Atlas 950 SuperPoD arrives Q4 2026. The Atlas 960 SuperPoD (15,488 NPUs, 30 EFLOPS FP8) follows in Q4 2027. The million-NPU SuperCluster is on the same timeline.

Whether these systems deliver on their promises remains to be seen. But the architectural ambition — memory fabric interconnect, open protocols, full-stack integration — represents a genuinely different approach to AI infrastructure than what the West is building.

For platform engineers and infrastructure architects, understanding this stack is no longer optional. It is part of the landscape.


Related: openFuyao: Kubernetes Cluster Computing, openEuler: Enterprise Linux, GPU Sharing on Kubernetes: MIG, MPS, Time-Slicing, Multi-Tenant GPU Platform Operating Model. Need help with AI infrastructure strategy? Book a consultation.

Free 30-min AI & Cloud consultation

Book Now