Skip to content

[Performance] Severe CPU Performance Regression on Intel vs Snapdragon/ARM (ORT 1.24.2) #27513

@HR-GavinJ

Description

@HR-GavinJ

Describe the issue

We're encountering a severe CPU EP performance issue on Windows Intel ORT which is 4-5x slower compared to AMD/ARM.

We performance profiled Moonshine Voice AI ONNX inference execution running on Surface Pro 11 (Snapdragon ARM64) vs Surface Pro 11 (Intel x64) so as to diagnose the 4–5× slower decoder performance on Intel. Both tests standardize on ORT 1.24.2, same model, same audio file. This appears to be a kernel/backend dispatch issue.

The decoder session dominates runtime (autoregressive path).

The following operations are significantly slower on Intel, per-kernel latency is consistently 3–5× worse on Intel

Op	                             Snapdragon	    Intel           Delta
DequantizeLinear	             ~800–900 µs	~3000–3500 µs	~4× slower
Large MatMul (vocab projection)	~1.4 ms	        ~2.4–3.0 ms    	~2–3× slower
Smaller attention MatMuls   	~100 µs	        ~400–500 µs	    ~4× slower
  • Intel shows significantly slower DequantizeLinear and quantized matmul kernels.
  • Snapdragon shows much better scaling and lower per-op latency.
  • Intel sometimes shows extremely slow first matmul events (possible weight packing), but steady-state still remains much slower.
  • Total decoder compute time explains the entire 4× end-to-end difference.

It appears that the Intel ONNX CPU provider might be:

  • Not dispatching to optimal ISA (AVX2 / AVX-VNNI / AMX)
  • Falling back to a slower quantized matmul kernel path
  • oneDNN / MLAS dispatch difference between ARM64 and x64 builds
  • Possible quantized GEMM implementation difference

  • Are there any known performance regression in quantized MatMul or DequantizeLinear on x64?
  • Is there a known ISA dispatch condition that could cause AVX2/VNNI to be bypassed?
  • Is ARM64/AMD CPU EP using a different optimized quantized kernel path compared to x64?

To reproduce

  • Enlist in Moonshine AI
  • Replace latest ORT Include/Lib/DLL in \core\third-party\onnxruntime\include & \core\third-party\onnxruntime\lib\windows\x86_64
  • Download the medium-streaming model, details below
  • Build project using \scripts\run-core-tests.bat

Execute \Core\Build\benchmark.exe --model-path \medium-streaming-en --model-arch 5 --wav-path Recording.wav

Recording.wav

Urgency

No response

Platform

Windows

OS Version

Windows 11/Windows 10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.24.2

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

Moonshine Voice model can be downloaded by following instructions here

Is this a quantized model?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:oneDNNquestions/issues related to DNNL EPperformanceissues related to performance regressions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions