ANEForge
Apple Neural Engine · direct access · no CoreML

Train and compute on the chip Apple built for inference only.

ANEForge compiles a tensor graph into one program for the Apple Neural Engine and runs it from Python, without CoreML. The same path trains on the engine, optimizer included. The compute unit is fixed, so a model never falls to CPU/GPU.

quickstart.py
import aneforge as af

x   = af.input((1, 3, 32, 32))    # lazy graph input
y   = af.conv(x, W, pad=1).relu().mean((2, 3))
net = af.compile(y, compress="int8")
out = net(image)                  # runs on the ANE

enc = af.load(".../all-MiniLM-L6-v2")
vec = enc(tokens)                 # cosine 1.0000
At a glance
ResNet-18, one image0.33 ms on the engine
vs GPU (float32)6.1× faster, 2.0 ms
ANE rail, hot loop4.5 W
fidelitycosine 1.0000
operators58 fused + 19 bridge
dispatch floor~70 us
Software, paper, guide

The package, the paper behind it, and a guide to the engine.

Software · MIT Licensed

ANEForge

The Python package. Build a graph, compile it to one ANE program, then run or train it. It handles quantized weights, native attention, resident state, and cross-compilation for 28 targets.

GitHub → Docs →
arXiv · preprint

The paper

“Python for direct computation on the Apple Neural Engine.” The preprint behind the numbers on this page, with the dispatch path and the validation against reference in full.

arXiv →
Book · web edition

The guide

A reverse-engineering account of the engine: the datapath, the compiler, the program format, the firmware, the kernel driver, and the cross-silicon target tables. Most of it is undocumented by Apple.

Read →
What it does

Training, native layers, compression, and arbitrary workloads.

Apple exposes the Neural Engine only through CoreML, and only for inference. ANEForge dispatches through the same private aned stack that CoreML uses internally, from a normal user process.

Training runs on the engine CIFAR-10 → 71%

The forward pass, backward pass, and Adam update all compile to ANE programs, so a model trains end to end on the engine.

Layers CoreML can’t reach +19 bridge ops

af.sdpa drives fused attention directly, what CoreML never emits. Sort, argmax, topk, geometry too.

Streaming weight compression 4× smaller

int8, int4-LUT, or sparse weights via the dequant path, accuracy-gated. KV-cache and optimizer state stay resident.

Run arbitrary workloads FFT, BLAS

Not just neural nets. FFTs, linear algebra, and fluid sims compile straight to the engine.

How it compares

CoreML is the only public route to the engine, and all it decides is whether to use it.

PathOn the ANENo CoreMLTrains on it
CoreML / coremltoolsscheduler choosesnono
MLX, PyTorch (MPS)no (GPU)yeson the GPU
ANEForgeyes (direct)yesyes

ANEForge compiles to the engine from an ordinary user process, with no entitlement and no SIP changes.

Measured

ResNet-18 in 0.33 ms on the engine, 6× the GPU and a fraction of the CPU.

ResNet-18 forwardANEGPUCPU
end-to-end latency0.33 ms2.0 ms6.0 ms

M5 Pro, macOS 26.5; the GPU and CPU baselines are PyTorch at float32, and the ANE rail draws 4.5 W during the hot loop. ResNet-18, a ViT-B/16, and a MiniLM encoder each match their float32 reference to cosine 1.0000.

Demos

Simulations and learning, run and trained on the engine.

Not just neural nets. Each of these compiles to ANE programs and runs end to end on the Neural Engine.

Get started

Apple Silicon, macOS 14+, Python 3.10+.

$ pip install aneforge

The e5rt dispatch shim builds against Apple frameworks on first use. Then browse examples/, starting with the quickstart.