Train and compute on the chip Apple built for inference only.
ANEForge compiles a tensor graph into one program for the Apple Neural Engine and runs it from Python, without CoreML. The same path trains on the engine, optimizer included. The compute unit is fixed, so a model never falls to CPU/GPU.
import aneforge as af x = af.input((1, 3, 32, 32)) # lazy graph input y = af.conv(x, W, pad=1).relu().mean((2, 3)) net = af.compile(y, compress="int8") out = net(image) # runs on the ANE enc = af.load(".../all-MiniLM-L6-v2") vec = enc(tokens) # cosine 1.0000
The package, the paper behind it, and a guide to the engine.
ANEForge
The Python package. Build a graph, compile it to one ANE program, then run or train it. It handles quantized weights, native attention, resident state, and cross-compilation for 28 targets.
GitHub → Docs →The paper
“Python for direct computation on the Apple Neural Engine.” The preprint behind the numbers on this page, with the dispatch path and the validation against reference in full.
arXiv →The guide
A reverse-engineering account of the engine: the datapath, the compiler, the program format, the firmware, the kernel driver, and the cross-silicon target tables. Most of it is undocumented by Apple.
Read →Training, native layers, compression, and arbitrary workloads.
Apple exposes the Neural Engine only through CoreML, and only for inference. ANEForge dispatches through the same private aned stack that CoreML uses internally, from a normal user process.
Training runs on the engine CIFAR-10 → 71%
The forward pass, backward pass, and Adam update all compile to ANE programs, so a model trains end to end on the engine.
Layers CoreML can’t reach +19 bridge ops
af.sdpa drives fused attention directly, what CoreML never emits. Sort, argmax, topk, geometry too.
Streaming weight compression 4× smaller
int8, int4-LUT, or sparse weights via the dequant path, accuracy-gated. KV-cache and optimizer state stay resident.
Run arbitrary workloads FFT, BLAS
Not just neural nets. FFTs, linear algebra, and fluid sims compile straight to the engine.
CoreML is the only public route to the engine, and all it decides is whether to use it.
| Path | On the ANE | No CoreML | Trains on it |
|---|---|---|---|
| CoreML / coremltools | scheduler chooses | no | no |
| MLX, PyTorch (MPS) | no (GPU) | yes | on the GPU |
| ANEForge | yes (direct) | yes | yes |
ANEForge compiles to the engine from an ordinary user process, with no entitlement and no SIP changes.
ResNet-18 in 0.33 ms on the engine, 6× the GPU and a fraction of the CPU.
| ResNet-18 forward | ANE | GPU | CPU |
|---|---|---|---|
| end-to-end latency | 0.33 ms | 2.0 ms | 6.0 ms |
M5 Pro, macOS 26.5; the GPU and CPU baselines are PyTorch at float32, and the ANE rail draws 4.5 W during the hot loop. ResNet-18, a ViT-B/16, and a MiniLM encoder each match their float32 reference to cosine 1.0000.
Simulations and learning, run and trained on the engine.
Not just neural nets. Each of these compiles to ANE programs and runs end to end on the Neural Engine.
Fluid simulation
A dye painted as the word ANEForge, stirred by a 2-D incompressible Navier–Stokes flow. Every FFT in the pseudo-spectral loop runs on the engine, about 54 J at the 1.48 W rail.
reproduce →
Reaction-diffusion
Gray–Scott Turing patterns bloom from the word into a branching labyrinth. One program re-dispatches each step: a 3×3 Laplacian as a native ANE conv, the reactions as elementwise ops.
reproduce →
A network that grows
A small CNN update rule, trained on the engine, grows a lizard from one seed pixel. The forward and backward passes both run on the ANE, so the rule is learned there, not just replayed.
reproduce →Apple Silicon, macOS 14+, Python 3.10+.
The e5rt dispatch shim builds against Apple frameworks on first use. Then browse examples/, starting with the quickstart.