[Performance] Severe CPU Performance Regression on Intel vs Snapdragon/ARM (ORT 1.24.2)

### Describe the issue

We're encountering a severe CPU EP performance issue on Windows Intel ORT which is 4-5x slower compared to AMD/ARM. 

We performance profiled [Moonshine Voice AI](https://github.com/moonshine-ai/moonshine) ONNX inference execution running on Surface Pro 11 (Snapdragon ARM64) vs Surface Pro 11 (Intel x64) so as to diagnose the 4–5× slower decoder performance on Intel. Both tests standardize on ORT 1.24.2, same model, same audio file. This appears to be a kernel/backend dispatch issue.

The decoder session dominates runtime (autoregressive path).

The following operations are significantly slower on Intel, per-kernel latency is consistently 3–5× worse on Intel

```
Op	                             Snapdragon	    Intel           Delta
DequantizeLinear	             ~800–900 µs	~3000–3500 µs	~4× slower
Large MatMul (vocab projection)	~1.4 ms	        ~2.4–3.0 ms    	~2–3× slower
Smaller attention MatMuls   	~100 µs	        ~400–500 µs	    ~4× slower
```


- Intel shows significantly slower DequantizeLinear and quantized matmul kernels.
- Snapdragon shows much better scaling and lower per-op latency.
- Intel sometimes shows extremely slow first matmul events (possible weight packing), but steady-state still remains much slower.
- Total decoder compute time explains the entire 4× end-to-end difference.

<br>

It appears that the Intel ONNX CPU provider might be:

- Not dispatching to optimal ISA (AVX2 / AVX-VNNI / AMX)
- Falling back to a slower quantized matmul kernel path
- oneDNN / MLAS dispatch difference between ARM64 and x64 builds
- Possible quantized GEMM implementation difference
<br>

- Are there any known performance regression in quantized MatMul or DequantizeLinear on x64?
- Is there a known ISA dispatch condition that could cause AVX2/VNNI to be bypassed?
- Is ARM64/AMD CPU EP using a different optimized quantized kernel path compared to x64?


### To reproduce

- Enlist in [Moonshine AI](https://github.com/moonshine-ai/moonshine)
- Replace latest ORT Include/Lib/DLL in \core\third-party\onnxruntime\include & \core\third-party\onnxruntime\lib\windows\x86_64
- Download the medium-streaming model, details below
- Build project using \scripts\run-core-tests.bat

Execute \Core\Build\benchmark.exe  --model-path <path to>\medium-streaming-en --model-arch 5 --wav-path <path-to>Recording.wav

[Recording.wav](https://github.com/user-attachments/files/25693866/Recording.wav)


### Urgency

_No response_

### Platform

Windows

### OS Version

Windows 11/Windows 10

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.24.2

### ONNX Runtime API

C++

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

### Model File

Moonshine Voice model can be downloaded by following instructions [here](https://github.com/moonshine-ai/moonshine?tab=readme-ov-file#models)

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Severe CPU Performance Regression on Intel vs Snapdragon/ARM (ORT 1.24.2) #27513

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Severe CPU Performance Regression on Intel vs Snapdragon/ARM (ORT 1.24.2) #27513

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions