Apple Neural Engine · direct access · no CoreML

Train and compute on the chip Apple built for inference only.

ANEForge compiles a tensor graph into one program for the Apple Neural Engine and runs it from Python, without CoreML. The same path trains on the engine, optimizer included. The compute unit is fixed, so a model never falls to CPU/GPU.

Get ANEForge → Read the paper → Read the guide →

quickstart.py

import aneforge as af

x   = af.input((1, 3, 32, 32))    # lazy graph input
y   = af.conv(x, W, pad=1).relu().mean((2, 3))
net = af.compile(y, compress="int8")
out = net(image)                  # runs on the ANE

enc = af.load(".../all-MiniLM-L6-v2")
vec = enc(tokens)                 # cosine 1.0000

At a glance

ResNet-18, one image0.33 ms on the engine

vs GPU (float32)6.1× faster, 2.0 ms

ANE rail, hot loop4.5 W

fidelitycosine 1.0000

operators58 fused + 19 bridge

dispatch floor~70 us

Software, paper, guide

The package, the paper behind it, and a guide to the engine.

Software · MIT Licensed

ANEForge

The Python package. Build a graph, compile it to one ANE program, then run or train it. It handles quantized weights, native attention, resident state, and cross-compilation for 28 targets.

GitHub → Docs →

arXiv · preprint

The paper

“Python for direct computation on the Apple Neural Engine.” The preprint behind the numbers on this page, with the dispatch path and the validation against reference in full.

arXiv →

Book · web edition

The guide

A reverse-engineering account of the engine: the datapath, the compiler, the program format, the firmware, the kernel driver, and the cross-silicon target tables. Most of it is undocumented by Apple.

Read →

What it does

Training, native layers, compression, and arbitrary workloads.

Apple exposes the Neural Engine only through CoreML, and only for inference. ANEForge dispatches through the same private aned stack that CoreML uses internally, from a normal user process.

Training runs on the engine CIFAR-10 → 71%

The forward pass, backward pass, and Adam update all compile to ANE programs, so a model trains end to end on the engine.

Layers CoreML can’t reach +19 bridge ops

af.sdpa drives fused attention directly, what CoreML never emits. Sort, argmax, topk, geometry too.

Streaming weight compression 4× smaller

int8, int4-LUT, or sparse weights via the dequant path, accuracy-gated. KV-cache and optimizer state stay resident.

Run arbitrary workloads FFT, BLAS

Not just neural nets. FFTs, linear algebra, and fluid sims compile straight to the engine.

How it compares

CoreML is the only public route to the engine, and all it decides is whether to use it.

Path	On the ANE	No CoreML	Trains on it
CoreML / coremltools	scheduler chooses	no	no
MLX, PyTorch (MPS)	no (GPU)	yes	on the GPU
ANEForge	yes (direct)	yes	yes

ANEForge compiles to the engine from an ordinary user process, with no entitlement and no SIP changes.

Measured

ResNet-18 in 0.33 ms on the engine, 6× the GPU and a fraction of the CPU.

ResNet-18 forward	ANE	GPU	CPU
end-to-end latency	0.33 ms	2.0 ms	6.0 ms

M5 Pro, macOS 26.5; the GPU and CPU baselines are PyTorch at float32, and the ANE rail draws 4.5 W during the hot loop. ResNet-18, a ViT-B/16, and a MiniLM encoder each match their float32 reference to cosine 1.0000.

Demos

Simulations and learning, run and trained on the engine.

Not just neural nets. Each of these compiles to ANE programs and runs end to end on the Neural Engine.

A passive dye shaped as the word ANEForge stirred into glowing filaments by a fluid simulation on the Apple Neural Engine — Fluid simulation

A dye painted as the word ANEForge, stirred by a 2-D incompressible Navier–Stokes flow. Every FFT in the pseudo-spectral loop runs on the engine, about 54 J at the 1.48 W rail.
reproduce →

A Gray-Scott reaction-diffusion system grown from the word ANEForge into a branching labyrinth on the Apple Neural Engine — Reaction-diffusion

Gray–Scott Turing patterns bloom from the word into a branching labyrinth. One program re-dispatches each step: a 3×3 Laplacian as a native ANE conv, the reactions as elementwise ops.
reproduce →

A neural cellular automaton, trained on the Apple Neural Engine, grows a lizard from a single seed pixel — A network that grows

A small CNN update rule, trained on the engine, grows a lizard from one seed pixel. The forward and backward passes both run on the ANE, so the rule is learned there, not just replayed.
reproduce →

Get started

Apple Silicon, macOS 14+, Python 3.10+.

$ pip install aneforge

The e5rt dispatch shim builds against Apple frameworks on first use. Then browse examples/, starting with the quickstart.