Inside the NPU: Why Every Major Chip Now Has a Neural Engine — and What It Actually Does | IRCNF - Intelligent Reliable Custom Next-gen Frameworks

A quiet hardware transition has been building for three years, and in 2026 it's essentially complete: nearly every consumer-grade processor shipped by Apple, Qualcomm, Intel, AMD, and MediaTek now includes a dedicated neural processing unit. The NPU is no longer an enthusiast spec. It's the new baseline.

The shift is significant enough that Windows 11's Copilot+ certification program made a minimum 40 TOPS (tera-operations per second) NPU requirement a hard gate for certification. In practice, what do these chips do — and why couldn't existing GPU and CPU hardware handle the same workloads?

Why a Separate Chip for AI

The GPU didn't disappear from the AI stack — it remains the dominant compute substrate for training and for large-scale inference in data centers. But GPUs are energy-hungry and optimized for parallelism at scale. A phone or laptop using a mobile GPU for continuous AI inference — background noise cancellation, real-time translation, video enhancement — would drain the battery in a few hours.

NPUs solve this with specialization. Unlike a GPU (which runs general parallel workloads) or a CPU (which excels at sequential, branching logic), an NPU is purpose-built for the matrix multiplications and activation functions that dominate neural network inference. The result is orders-of-magnitude better energy efficiency for a narrow but growing class of tasks.

Apple has been shipping NPUs since the A11 Bionic in 2017, initially marketed as the "Neural Engine" for Face ID. The A11's Neural Engine executed 600 billion operations per second. The A18 Pro in the iPhone 16 Pro does 35 TOPS — a nearly 60x improvement in nine years, on a chip that still fits in a phone.

The Current Landscape by Platform

Qualcomm's Snapdragon X Elite, the chip powering most of the Copilot+ Windows laptops released in 2024–2025, delivers 45 TOPS through its Hexagon NPU. Qualcomm claims 4.5x better efficiency per watt than comparable GPU inference on the same tasks — a figure that holds up reasonably well in independent testing.

Apple's M4 Pro pushes 38 TOPS from its Neural Engine, with Apple reporting substantial gains on Core ML benchmarks over the M3 generation. The M-series chips benefit from unified memory architecture — the Neural Engine shares the same high-bandwidth memory pool as the CPU and GPU, eliminating the copy overhead that hobbles discrete GPU inference on small models.

Intel's Core Ultra 200 series (Lunar Lake) marks Intel's most competitive NPU to date at 48 TOPS — designed specifically to clear the Copilot+ threshold by a margin that allows for future Windows AI requirements. AMD's Ryzen AI 300 series reaches 50 TOPS. MediaTek's Dimensity 9400, which powers the Samsung Galaxy S25 series, achieves 50 TOPS with significant efficiency gains over the previous generation.

What NPUs Are Actually Running

The use cases fall into consistent categories:

Continuous, latency-sensitive tasks. Real-time transcription (Apple's Live Text, Windows Studio voice clarity), background blur in video calls, and active noise cancellation are tasks where GPU latency is too high and cloud round-trips introduce unacceptable lag. NPUs handle these continuously with minimal power draw.

On-device LLM inference. Models in the 1B–8B parameter range — Phi-3 Mini, Gemma 3 4B, Llama 3.2 3B — can run entirely on-device via the NPU when quantized to 4-bit precision. Apple's Private Cloud Compute architecture offloads only the tasks too large for the Neural Engine. On Windows, Microsoft's Phi-3 Mini runs natively via DirectML on the Hexagon NPU for on-device Copilot responses.

Computational photography. Real-time HDR fusion, semantic segmentation for background replacement, face mesh tracking for AR — these are NPU workloads on every current flagship phone. The camera processing pipeline has largely migrated from the ISP to the NPU over the past three years.

Search and retrieval indexing. Windows Recall uses the NPU to continuously process screenshots and create a searchable semantic index. Apple's on-device Photos search uses the Neural Engine for image embedding and similarity matching.

The Benchmark Problem

TOPS is a deceptive metric. It measures peak throughput under ideal conditions — sustained matrix multiplication with all execution units firing. Real AI workloads are spikier and more irregular. A 50-TOPS NPU running a poorly optimized model may underperform a 35-TOPS chip with better compiler support and memory architecture.

The emerging standard for practical NPU benchmarking is MLPerf Mobile, which measures end-to-end performance on standardized models rather than raw TOPS. The gap between paper specs and MLPerf results can be wide. Some high-TOPS chips underperform significantly on tasks that weren't central to their design.

What This Means for Developers

The existence of widely deployed NPUs is creating a new tier in the AI deployment stack. The current split: cloud inference for large models (GPT-4, Claude 3.7+, Gemini 2.5), on-device NPU inference for models up to ~8B parameters at 4-bit quantization, and a growing middle tier of server-class edge inference for 13B–70B models.

For developers building AI-powered features, the practical question is now which inference tier fits the use case — not just whether cloud inference is available. Tasks with strict privacy requirements, low latency needs, or offline requirements should be targeting on-device inference via Core ML, Windows ML, or Android NNAPI. The frameworks are maturing. The hardware is there.

The NPU race isn't slowing. Qualcomm's next-generation Snapdragon platform is expected to push past 70 TOPS. Apple's A19 Pro family is targeting 45+ TOPS. The question is no longer whether your device has an AI chip — it's which parts of your workload you've moved to it.