-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the issue
We're encountering a severe CPU EP performance issue on Windows Intel ORT which is 4-5x slower compared to AMD/ARM.
We performance profiled Moonshine Voice AI ONNX inference execution running on Surface Pro 11 (Snapdragon ARM64) vs Surface Pro 11 (Intel x64) so as to diagnose the 4–5× slower decoder performance on Intel. Both tests standardize on ORT 1.24.2, same model, same audio file. This appears to be a kernel/backend dispatch issue.
The decoder session dominates runtime (autoregressive path).
The following operations are significantly slower on Intel, per-kernel latency is consistently 3–5× worse on Intel
Op Snapdragon Intel Delta
DequantizeLinear ~800–900 µs ~3000–3500 µs ~4× slower
Large MatMul (vocab projection) ~1.4 ms ~2.4–3.0 ms ~2–3× slower
Smaller attention MatMuls ~100 µs ~400–500 µs ~4× slower
- Intel shows significantly slower DequantizeLinear and quantized matmul kernels.
- Snapdragon shows much better scaling and lower per-op latency.
- Intel sometimes shows extremely slow first matmul events (possible weight packing), but steady-state still remains much slower.
- Total decoder compute time explains the entire 4× end-to-end difference.
It appears that the Intel ONNX CPU provider might be:
- Not dispatching to optimal ISA (AVX2 / AVX-VNNI / AMX)
- Falling back to a slower quantized matmul kernel path
- oneDNN / MLAS dispatch difference between ARM64 and x64 builds
- Possible quantized GEMM implementation difference
- Are there any known performance regression in quantized MatMul or DequantizeLinear on x64?
- Is there a known ISA dispatch condition that could cause AVX2/VNNI to be bypassed?
- Is ARM64/AMD CPU EP using a different optimized quantized kernel path compared to x64?
To reproduce
- Enlist in Moonshine AI
- Replace latest ORT Include/Lib/DLL in \core\third-party\onnxruntime\include & \core\third-party\onnxruntime\lib\windows\x86_64
- Download the medium-streaming model, details below
- Build project using \scripts\run-core-tests.bat
Execute \Core\Build\benchmark.exe --model-path \medium-streaming-en --model-arch 5 --wav-path Recording.wav
Urgency
No response
Platform
Windows
OS Version
Windows 11/Windows 10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.24.2
ONNX Runtime API
C++
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
Moonshine Voice model can be downloaded by following instructions here
Is this a quantized model?
Yes