[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune by gcunhase · Pull Request #951 · NVIDIA/Model-Optimizer

gcunhase · 2026-03-02T18:15:14Z

What does this PR do?

Type of change: New feature

Overview: ONNX Autotune (also called Auto Q/DQ) is currently and standalone feature of ModelOpt that automatically adds Q/DQ where relevant according to information obtained from TensorRT inference. One issue is that the scales in those Q/DQ nodes are random.

This PR does 2 major things:

Integrates Auto Q/DQ into the ONNX quantization workflow; and
Enables calibration data to be used to obtain the correct scales for the Q/DQ nodes.

Usage

$ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive}

Please see __main__.py for other args.

Testing

Added unittest for Q/DQ node placement validation: tests/gpu/onnx/quantization/test_autotune_quantization_integration.py
Verified that accuracy was recovered by integrating MOQ with Autotune. Results on RTX 3090 with TRT 10.12.0.36 (--stronglyTyped) with ViT, as per examples/onnx_ptq:

Model	Top-1 acc	Top-5 acc
FP32	85.1%	97.5%
FP16 (FP32 with --fp16)	85.1%	97.5%
Quant (MOQ)	82.4%	96.4%
Quant (Autotune)	0.1%	0.5%
Quant (MOQ + Autotune)	79.6%	95.0%

Notice that accuracy was mostly recovered from standalone Autotune to MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and MOQ + Autotune is likely due to some sensitive nodes being quantized, such as BiasAdd.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No (will be done in a different PR)
Did you update Changelog?: No

Summary by CodeRabbit

New Features
- Autotuning added to ONNX quantization: CLI flags, presets, per-region tuning, and FP8/INT8 support; accepts in-memory models and optional output dirs; node-filter loading and explicit-flag CLI behavior.
- Activation-operation accessor exposed and autotune helpers added to the package API.
Bug Fixes
- Safer graph rewiring to avoid corrupting quantized graphs when targets are absent.
Tests
- New integration test and model helper validating autotune quantization consistency.

Additional information

To reproduce accuracy with ViT, call download_example_onnx.py and image_prep.py without --fp16.

If --fp16 is used here, quantizing this model with --autotune results in the following error:

[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type

This is fixed in #978.

coderabbitai · 2026-03-02T18:15:31Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds ONNX quantization autotune: CLI presets and flags, autotune workflows and helpers, integration of autotune results into FP8/INT8 quantizers and ORT configuration, defensive graph rewiring, an activation-op helper, and tests plus test-model helpers.

Changes

Cohort / File(s)	Summary
Activation Operations Taxonomy `modelopt/onnx/op_types.py`	Added `get_activation_ops()` returning a set of activation op names.
CLI / Entrypoint `modelopt/onnx/quantization/__main__.py`	Added `--autotune` CLI, presets, `apply_mode_presets()` and propagation of autotune-related args into `quantize` invocation.
Autotune public API & CLI helpers `modelopt/onnx/quantization/autotune/__init__.py`, `modelopt/onnx/quantization/autotune/__main__.py`	Exported `MODE_PRESETS`, `StoreWithExplicitFlag`, `get_node_filter_list`; added `get_node_filter_list()`; renamed action class to `StoreWithExplicitFlag` and applied across parser args.
Autotune workflows `modelopt/onnx/quantization/autotune/workflows.py`	`region_pattern_autotuning_workflow()` accepts `onnx.ModelProto` or path, optional `output_dir` (creates temp dir when None) and adds `default_dq_dtype` param; ensures temp-dir cleanup.
Autotuner base & insertion-point APIs `modelopt/onnx/quantization/autotune/autotuner_base.py`	Added imports and methods `get_resolved_insertion_points()` and `get_ort_quantization_config()`; `export_onnx` delegates insertion-point resolution to these helpers.
Quantize integration & autotune helper `modelopt/onnx/quantization/quantize.py`	Added `_find_nodes_to_quantize_autotune()` and extended `quantize()` signature to accept and merge many autotune parameters/results into quantization flow.
FP8 quantizer `modelopt/onnx/quantization/fp8.py`	Added `autotune: bool = False` param; when `autotune` True, skip several pattern-detection/expansion steps and propagate `op_types_needing_output_quant`.
INT8 quantizer `modelopt/onnx/quantization/int8.py`	Added `autotune: bool = False` param; gate automatic exclusions and node-expansion when `autotune` True (preserve non-autotune behavior otherwise).
ORT utilities `modelopt/onnx/quantization/ort_utils.py`	Added `op_types_needing_output_quant` parameter to `configure_ort()` and integrated it into output-quantization exclusion logic and TRT guided options.
Graph utilities `modelopt/onnx/quantization/graph_utils.py`	Hardened `remove_partial_input_qdq()` to locate target node inputs by DQ output name and skip rewiring when target or matching input is missing.
Autotune internals & imports `modelopt/onnx/quantization/autotune/...`	Added imports (e.g., `get_activation_ops`, insertion-point types/helpers) and wiring to compute resolved insertion points and produce ORT configs from autotune results.
Tests & test models `tests/_test_utils/onnx/quantization/autotune/models.py`, `tests/gpu/onnx/quantization/test_autotune_quantization_integration.py`	Added `_create_simple_resnet18_model()` helper and integration test `test_autotune_quantization_integration` verifying autotune-derived placements are applied by `quantize`.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI
    participant Entry as quantize CLI
    participant Finder as _find_nodes_to_quantize_autotune
    participant Autotune as Autotune Workflow
    participant Quantizer as FP8/INT8 Quantizer
    participant ORT as ORT Configurator
    participant Model as ONNX Model

    CLI->>Entry: invoke quantize(..., autotune=True)
    Entry->>Entry: apply_mode_presets(args)
    Entry->>Finder: _find_nodes_to_quantize_autotune(onnx_model,...)
    Finder->>Autotune: region_pattern_autotuning_workflow(model_or_path, output_dir?)
    Autotune->>Model: analyze regions & patterns
    Autotune-->>Finder: resolved insertion points & configs
    Finder->>Finder: get_resolved_insertion_points(best=True)
    Finder->>Finder: get_ort_quantization_config()
    Finder-->>Entry: nodes_to_quantize, no_quantize_inputs, op_types_needing_output_quant
    Entry->>Quantizer: quantize(nodes_to_quantize,..., autotune=True)
    Quantizer->>ORT: configure_ort(op_types_needing_output_quant)
    ORT-->>Quantizer: configuration applied
    Quantizer->>Model: apply quantization (skip pattern expansion for autotune)
    Quantizer-->>Entry: quantized_model

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: integrating real Q/DQ scales into the Autotune feature for ONNX quantization, which is the primary objective of this comprehensive PR.
Docstring Coverage	✅ Passed	Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	A comprehensive security review of all modified files in this PR against SECURITY.md guidelines found no security anti-patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/unit/onnx/quantization/autotune/test_region.py (1)

16-21: ⚠️ Potential issue | 🟡 Minor

Remove duplicate license text block.

Lines 16-21 duplicate the license disclaimer already present in lines 10-14. This appears to be a copy-paste error.

🔧 Proposed fix

 # limitations under the License.
-
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 """Tests for the Region class in the autotuner."""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_region.py` around lines 16 - 21,
Remove the duplicated license disclaimer block that was accidentally
copy-pasted; locate the repeated Apache license/disclaimer text that appears a
second time and delete the redundant block so only the original license header
remains at the top of the file (ensure the first license header is preserved and
no other content is altered).

modelopt/onnx/quantization/fp8.py (1)

219-232: ⚠️ Potential issue | 🟠 Major

Potential AttributeError if nodes_to_exclude is None.

Same issue as in int8.py: line 232 calls nodes_to_exclude.extend() before validation on line 236. If nodes_to_exclude is passed as None, this will fail.

🐛 Proposed fix

     enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True)
+    nodes_to_exclude = nodes_to_exclude or []
     if enable_gemv_detection_for_trt and not autotune:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/fp8.py` around lines 219 - 232, The block that
calls nodes_to_exclude.extend(...) when enable_gemv_detection_for_trt and not
autotune can raise AttributeError if nodes_to_exclude is None; before calling
find_nodes_from_matmul_to_exclude and extending, ensure nodes_to_exclude is
initialized to a list (e.g., if nodes_to_exclude is None assign an empty list)
or guard the extend call by creating a new list and assigning it back to
nodes_to_exclude; update the code around enable_gemv_detection_for_trt /
autotune, the find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude
handling so extend is only called on a list.

modelopt/onnx/quantization/int8.py (1)

161-174: ⚠️ Potential issue | 🟠 Major

Potential AttributeError if nodes_to_exclude is None.

When enable_gemv_detection_for_trt is True and autotune is False, line 174 calls nodes_to_exclude.extend() before nodes_to_exclude is validated/converted by find_nodes_to_exclude() on line 178. If nodes_to_exclude is passed as None, this will raise an AttributeError.

🐛 Proposed fix

     enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True)
+    nodes_to_exclude = nodes_to_exclude or []
     if enable_gemv_detection_for_trt and not autotune:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/int8.py` around lines 161 - 174, The code may call
nodes_to_exclude.extend(...) when nodes_to_exclude can be None; ensure
nodes_to_exclude is a list before extending: in the block guarded by
enable_gemv_detection_for_trt and not autotune, either initialize
nodes_to_exclude if None (e.g., nodes_to_exclude = nodes_to_exclude or []) or
call find_nodes_to_exclude() earlier and assign/normalize nodes_to_exclude
before using extend; update the logic around nodes_to_exclude,
find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude to guarantee
nodes_to_exclude is always a list when extending.

🧹 Nitpick comments (3)

modelopt/onnx/quantization/quantize.py (1)

272-274: Filename replacement may fail with edge-case paths.

Using onnx_path.replace(".onnx", ".quant_autotune.onnx") could produce unexpected results if ".onnx" appears elsewhere in the path (e.g., /models/onnx.models/model.onnx).

💡 Safer alternative using path manipulation

+    import os
     # Export model with Q/DQ insertion
-    onnx_path_autotune = onnx_path.replace(".onnx", ".quant_autotune.onnx")
+    base, ext = os.path.splitext(onnx_path)
+    onnx_path_autotune = f"{base}.quant_autotune{ext}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/quantize.py` around lines 272 - 274, The filename
construction using onnx_path.replace(".onnx", ".quant_autotune.onnx") can
mis-replace when ".onnx" appears elsewhere in the path; change the logic that
computes onnx_path_autotune to use proper path/suffix manipulation (e.g.,
Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before
calling autotuner.export_onnx and appending to intermediate_generated_files,
updating references to onnx_path_autotune, onnx_path, and the
autotuner.export_onnx call accordingly.

modelopt/onnx/utils.py (1)

175-191: Potential IndexError if input/output lists contain unexpected elements.

The list comprehension assumes inp.inputs[0] and out.outputs[0] exist when inp.inputs / out.outputs are truthy. While graphsurgeon typically ensures non-empty lists here, adding explicit length checks would make this more robust.

🛡️ Proposed defensive fix

     return [
         node
         for node in graph.nodes
-        if any(inp.inputs[0].op == "DequantizeLinear" for inp in node.inputs if inp.inputs)
-        or any(out.outputs[0].op == "QuantizeLinear" for out in node.outputs if out.outputs)
+        if any(
+            len(inp.inputs) > 0 and inp.inputs[0].op == "DequantizeLinear"
+            for inp in node.inputs
+            if inp.inputs
+        )
+        or any(
+            len(out.outputs) > 0 and out.outputs[0].op == "QuantizeLinear"
+            for out in node.outputs
+            if out.outputs
+        )
     ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 175 - 191, The comprehension in
get_quantized_nodes assumes inp.inputs[0] and out.outputs[0] exist and can raise
IndexError; change the two any() guards to explicitly check length (or
truthiness plus index-safe access) before indexing (e.g., ensure len(inp.inputs)
> 0 and len(out.outputs) > 0) so you only evaluate inp.inputs[0].op ==
"DequantizeLinear" and out.outputs[0].op == "QuantizeLinear" when the lists have
at least one element; update the generator to use these safe conditions around
node.inputs and node.outputs to avoid crashes.

modelopt/onnx/quantization/autotune/workflows.py (1)

202-203: Docstring should document temp directory behavior.

The docstring for output_dir doesn't mention that when None is provided, a temporary directory is automatically created via tempfile.mkdtemp(). This is important for API consumers to understand, especially since temp directories may accumulate if keep_output_dir=True (the default).

📝 Suggested docstring update

-        output_dir: Directory for output files (state, logs, models). Created if it doesn't exist.
+        output_dir: Directory for output files (state, logs, models). Created if it doesn't exist.
+                   If None, a temporary directory is created via tempfile.mkdtemp().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/workflows.py` around lines 202 - 203,
Update the docstring for the output_dir parameter in the function/class that
defines output_dir (the docstring in workflows.py around the autotune workflow)
to explicitly state that when output_dir is None a temporary directory is
created via tempfile.mkdtemp(), and note that the temporary directory will be
retained if keep_output_dir=True (the default), so callers may need to remove it
to avoid accumulation; reference the output_dir parameter name and the
keep_output_dir flag in the description.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 386-390: The log message in the cleanup branch is inverted: inside
the if not keep_output_dir block (where shutil.rmtree(output_dir) is called)
update the logger.debug message to tell users to set keep_output_dir=True to
retain the directory; specifically modify the message emitted by logger.debug
near the removal call that references output_dir and keep_output_dir so it
correctly reads that setting keep_output_dir=True will keep the directory.

In `@modelopt/onnx/quantization/quantize.py`:
- Around line 246-253: The function _find_nodes_to_quantize_autotune uses a
mutable default for intermediate_generated_files (list[str] = []); change the
signature to use None as the default (intermediate_generated_files:
Optional[list[str]] = None) and inside the function, if
intermediate_generated_files is None then set intermediate_generated_files = []
so each call gets a fresh list; update any type hints/imports if needed and
ensure all code in _find_nodes_to_quantize_autotune that appends or inspects
intermediate_generated_files works with the new initialization.

In `@setup.py`:
- Line 62: The dependency entry for "cuda-python" in setup.py lacks a version
constraint and the inline comment "For autotune" is misleading; change the
dependency to include a minimum version compatible with your CUDA/driver/ONNX
Runtime stack (e.g., "cuda-python>=13.0") and update the comment to accurately
state its purpose (e.g., "CUDA Python bindings for GPU/driver interactions -
ensure matches CUDA/ONNX Runtime version"). Ensure this follows the same pinning
style as other dependencies like "onnxslim>=0.1.76" and "polygraphy>=0.49.22".

---

Outside diff comments:
In `@modelopt/onnx/quantization/fp8.py`:
- Around line 219-232: The block that calls nodes_to_exclude.extend(...) when
enable_gemv_detection_for_trt and not autotune can raise AttributeError if
nodes_to_exclude is None; before calling find_nodes_from_matmul_to_exclude and
extending, ensure nodes_to_exclude is initialized to a list (e.g., if
nodes_to_exclude is None assign an empty list) or guard the extend call by
creating a new list and assigning it back to nodes_to_exclude; update the code
around enable_gemv_detection_for_trt / autotune, the
find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude handling so
extend is only called on a list.

In `@modelopt/onnx/quantization/int8.py`:
- Around line 161-174: The code may call nodes_to_exclude.extend(...) when
nodes_to_exclude can be None; ensure nodes_to_exclude is a list before
extending: in the block guarded by enable_gemv_detection_for_trt and not
autotune, either initialize nodes_to_exclude if None (e.g., nodes_to_exclude =
nodes_to_exclude or []) or call find_nodes_to_exclude() earlier and
assign/normalize nodes_to_exclude before using extend; update the logic around
nodes_to_exclude, find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude
to guarantee nodes_to_exclude is always a list when extending.

In `@tests/unit/onnx/quantization/autotune/test_region.py`:
- Around line 16-21: Remove the duplicated license disclaimer block that was
accidentally copy-pasted; locate the repeated Apache license/disclaimer text
that appears a second time and delete the redundant block so only the original
license header remains at the top of the file (ensure the first license header
is preserved and no other content is altered).

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 202-203: Update the docstring for the output_dir parameter in the
function/class that defines output_dir (the docstring in workflows.py around the
autotune workflow) to explicitly state that when output_dir is None a temporary
directory is created via tempfile.mkdtemp(), and note that the temporary
directory will be retained if keep_output_dir=True (the default), so callers may
need to remove it to avoid accumulation; reference the output_dir parameter name
and the keep_output_dir flag in the description.

In `@modelopt/onnx/quantization/quantize.py`:
- Around line 272-274: The filename construction using
onnx_path.replace(".onnx", ".quant_autotune.onnx") can mis-replace when ".onnx"
appears elsewhere in the path; change the logic that computes onnx_path_autotune
to use proper path/suffix manipulation (e.g.,
Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before
calling autotuner.export_onnx and appending to intermediate_generated_files,
updating references to onnx_path_autotune, onnx_path, and the
autotuner.export_onnx call accordingly.

In `@modelopt/onnx/utils.py`:
- Around line 175-191: The comprehension in get_quantized_nodes assumes
inp.inputs[0] and out.outputs[0] exist and can raise IndexError; change the two
any() guards to explicitly check length (or truthiness plus index-safe access)
before indexing (e.g., ensure len(inp.inputs) > 0 and len(out.outputs) > 0) so
you only evaluate inp.inputs[0].op == "DequantizeLinear" and out.outputs[0].op
== "QuantizeLinear" when the lists have at least one element; update the
generator to use these safe conditions around node.inputs and node.outputs to
avoid crashes.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f668a3 and 52b9c31.

📒 Files selected for processing (11)

modelopt/onnx/op_types.py
modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/autotune/workflows.py
modelopt/onnx/quantization/fp8.py
modelopt/onnx/quantization/graph_utils.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/ort_utils.py
modelopt/onnx/quantization/quantize.py
modelopt/onnx/utils.py
setup.py
tests/unit/onnx/quantization/autotune/test_region.py

modelopt/onnx/quantization/autotune/workflows.py

modelopt/onnx/quantization/quantize.py

setup.py

modelopt/onnx/quantization/__main__.py

modelopt/onnx/quantization/quantize.py

modelopt-bot · 2026-03-02T18:52:14Z

modelopt/onnx/quantization/quantize.py

@@ -242,6 +243,81 @@ def _preprocess_onnx(
    )




Overall Assessment

This PR is well-structured and achieves its stated goals of:

Integrating Auto Q/DQ into the ONNX quantization workflow

Enabling calibration data to obtain correct scales for Q/DQ nodes

The changes are substantial but well-organized across multiple files. Below are my detailed review comments.

Copilot

Pull request overview

Integrates ONNX Auto Q/DQ (TensorRT-driven autotuning) into the existing ONNX quantization workflow so Q/DQ placement can be derived from TensorRT profiling and then calibrated to produce real (non-random) Q/DQ scales.

Changes:

Added an --autotune flag (and autotune plumbing) to route INT8/FP8 quantization through the Auto Q/DQ placement workflow.
Introduced utilities to detect “quantized nodes” from a Q/DQ-inserted model and used this to drive node selection + ORT configuration tweaks (output quantization for certain producers).
Updated autotune workflow API to accept in-memory models and optionally auto-manage its output directory.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/unit/onnx/quantization/autotune/test_region.py`	Updates file header metadata.
`setup.py`	Adds `cuda-python` to ONNX optional dependencies to support TensorRT Python autotune benchmarking.
`modelopt/onnx/utils.py`	Adds `get_quantized_nodes()` helper for extracting quantized nodes from a Q/DQ graph.
`modelopt/onnx/quantization/quantize.py`	Adds `autotune` flag, integrates Auto Q/DQ placement, and feeds results into INT8/FP8 quantizers.
`modelopt/onnx/quantization/ort_utils.py`	Extends ORT configuration to optionally allow output quantization for selected op types.
`modelopt/onnx/quantization/int8.py`	Adds `autotune` plumbing and bypasses some default heuristics when autotune is enabled.
`modelopt/onnx/quantization/graph_utils.py`	Fixes partial-input Q/DQ removal to patch the intended consumer branch (shared Q/DQ case).
`modelopt/onnx/quantization/fp8.py`	Adds `autotune` plumbing and bypasses some default heuristics when autotune is enabled.
`modelopt/onnx/quantization/autotune/workflows.py`	Allows ModelProto input, optional output_dir, and adds optional output-dir cleanup.
`modelopt/onnx/quantization/__main__.py`	Adds CLI flag `--autotune`.
`modelopt/onnx/op_types.py`	Adds `get_activation_ops()` used by autotune integration logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/autotune/workflows.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/utils.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/graph_utils.py

modelopt/onnx/quantization/autotune/workflows.py

modelopt/onnx/utils.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/fp8.py

modelopt/onnx/quantization/autotune/workflows.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/graph_utils.py

setup.py

tests/unit/onnx/quantization/autotune/test_region.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/op_types.py

modelopt-bot

Review completed. I've posted several inline comments on specific lines. Overall this is a well-structured PR that successfully integrates Auto Q/DQ into the ONNX quantization workflow. Key highlights include good integration via _find_nodes_to_quantize_autotune, flexible API changes for in-memory models, and an important bug fix for shared Q/DQ pair handling. Please address the inline comments regarding documentation completion, code organization suggestions, and the copyright year consistency. Recommend approving with minor changes.

codecov · 2026-03-02T20:55:59Z

Codecov Report

❌ Patch coverage is 43.84615% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.59%. Comparing base (d3748c2) to head (1a531b9).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...elopt/onnx/quantization/autotune/autotuner_base.py	32.72%	37 Missing ⚠️
modelopt/onnx/quantization/quantize.py	31.25%	11 Missing ⚠️
modelopt/onnx/quantization/autotune/workflows.py	16.66%	10 Missing ⚠️
modelopt/onnx/quantization/autotune/__main__.py	18.18%	9 Missing ⚠️
modelopt/onnx/quantization/graph_utils.py	66.66%	3 Missing ⚠️
modelopt/onnx/quantization/int8.py	84.61%	2 Missing ⚠️
modelopt/onnx/op_types.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
- Coverage   71.71%   71.59%   -0.13%     
==========================================
  Files         211      211              
  Lines       23984    24034      +50     
==========================================
+ Hits        17200    17207       +7     
- Misses       6784     6827      +43

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/_test_utils/onnx/quantization/autotune/models.py

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/autotune/autotuner_base.py

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

modelopt/onnx/quantization/autotune/workflows.py

Copilot · 2026-03-03T20:52:01Z

modelopt/onnx/quantization/autotune/workflows.py

                   uses <output_dir>/autotuner_state.yaml (default: None)
        quant_type: Quantization data type - "int8" for INT8 quantization (default),
                   "fp8" for FP8 quantization
+        default_dq_dtype: Dtype for DequantizeLinear output; "float32" (default) or "float16".


The default_dq_dtype docstring says only "float32" or "float16", but the quantization workflow now maps high_precision_dtype="bf16" to default_dq_dtype="bfloat16". Please update the docstring to include bfloat16 (or more generally state that any dtype supported by export_utils.resolve_dtype() is accepted).

Suggested change

default_dq_dtype: Dtype for DequantizeLinear output; "float32" (default) or "float16".

default_dq_dtype: Dtype for DequantizeLinear output; e.g. "float32" (default), "float16",

"bfloat16", or any dtype supported by export_utils.resolve_dtype().

@willg-nv is bfloat16 also supported in default_dq_dtype?

BF16 is marked deprecated by TRT. I would suggest we don't support it as default dtype.

tests/_test_utils/onnx/quantization/autotune/models.py

modelopt/onnx/quantization/autotune/__init__.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

… silent corruption of the graph. Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

modelopt/onnx/quantization/autotune/autotuner_base.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

#998) ### What does this PR do? **Type of change**: Bug fix **Overview**: Replace CUDA memory management from CUDART to PyTorch (higher-level API). ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing 1. Added unittests. 2. Tested that this PR does not break #951 or #978 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: N/A  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A  ### Additional Information Summary of changes in `benchmark.py — TensorRTPyBenchmark`: | What changed | Before | After | |---|---|---| | Imports | `contextlib` + `from cuda.bindings import runtime as cudart` | `import torch` (conditional) | | Availability flag | `CUDART_AVAILABLE` | `TORCH_CUDA_AVAILABLE = torch.cuda.is_available()` | | `__init__` guard | checks `CUDART_AVAILABLE or cudart is None` | checks `TORCH_CUDA_AVAILABLE` | | `_alloc_pinned_host` | `cudaMallocHost` + ctypes address hack, returns `(ptr, arr, err)` | `torch.empty(...).pin_memory()`, returns `(tensor, tensor.numpy())` | | `_free_buffers` | `cudaFreeHost` + `cudaFree` per buffer | `bufs.clear()` — PyTorch GC handles deallocation | | `_allocate_buffers` | raw `device_ptr` integers, error-code returns | `torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address | | `_run_warmup` | `cudaMemcpyAsync` + `cudaStreamSynchronize` | `tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()` | | `_run_timing` | same cudart pattern | same torch pattern | | `run` — stream lifecycle | `cudaStreamCreate()` / `cudaStreamDestroy()` | `torch.cuda.Stream()` / `del stream` | | `run` — stream arg to TRT | raw integer handle | `stream.cuda_stream` (integer property) | | Error handling | `cudaError_t` return codes | PyTorch raises `RuntimeError`, caught by existing `except Exception` | Related to #961  ## Summary by CodeRabbit * **Refactor** * TensorRT benchmarking migrated from direct CUDA runtime calls to PyTorch CUDA tensors, pinned memory, and CUDA stream primitives — simplifying buffer management, transfers, and timing semantics. * **Tests** * Expanded GPU autotune benchmark tests with broader unit and integration coverage for CUDA/TensorRT paths, pinned-host/device buffering, stream behavior, warmup/timing, and end-to-end latency scenarios.  --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/onnx/quantization/int8.py (1)
217-231: ⚠️ Potential issue | 🟠 Major

Keep calibration-cache enforcement for autotune-selected producers.

iq_quantized_nodes only marks nodes whose inputs appear in the cache. That misses the producer nodes autotune adds for output quantization, so the new autotune guard avoids dropping them — but it also leaves truly uncached nodes in nodes_to_quantize. When the caller only supplies calibration_cache_path, those nodes fall back to the random CalibrationDataReader created in modelopt/onnx/quantization/quantize.py lines 541-547. Autotune needs a separate cache filter that also recognizes cached producer outputs instead of bypassing the filter entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/int8.py` around lines 217 - 231, The current cache
filter only marks nodes whose inputs appear in act_scales_dict
(iq_quantized_nodes), which misses producer nodes added by autotune that have
their *outputs* cached; update the logic that builds iq_quantized_nodes in the
block guarded by calibration_cache_path so it also considers node outputs (after
normalizing act_scales_dict keys the same way you do into quantized_tensors) in
addition to inputs, then compute the intersection with nodes_to_quantize exactly
as before (when not autotune) so any node that either consumes or produces a
cached tensor remains eligible for quantization; adjust references to
graph.nodes, iq_quantized_nodes, quantized_tensors, nodes_to_quantize,
calibration_cache_path, act_scales_dict and autotune accordingly.

🧹 Nitpick comments (1)

tests/_test_utils/onnx/quantization/autotune/models.py (1)
59-95: Reduce the default test tensor size.

This helper builds a multi-billion-op model by default (1024x1024 input), which makes the downstream GPU autotune integration path much slower and flakier than necessary for a placement test. Please parameterize the spatial size or default it to something much smaller.
♻️ Suggested adjustment
-def _create_simple_resnet18_model():
+def _create_simple_resnet18_model(input_hw: int = 224):
     """Build a ResNet-18 subgraph (stem + layer1) for MOQ + Autotuner integration tests.
@@
-    Input shape: [1, 3, 1024, 1024], output shape: [1, 64, 256, 256].
+    Input shape: [1, 3, input_hw, input_hw].
     """
@@
-    input_tensor = torch.zeros(1, 3, 1024, 1024)
+    input_tensor = torch.zeros(1, 3, input_hw, input_hw)
Based on learnings GPU-based unit tests in tests/gpu should be fast and in most cases not take more than a few seconds to run.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_test_utils/onnx/quantization/autotune/models.py` around lines 59 - 95,
The test helper _create_simple_resnet18_model() uses a huge default input
(1,3,1024,1024) which makes GPU autotune tests slow; change the function to
accept a spatial_size parameter with a much smaller default (e.g., 64 or 128),
use that parameter when constructing input_tensor and any shape-related
comments, and keep the model classes (_BasicBlock and _Model) unchanged; also
update any tests that call _create_simple_resnet18_model() to pass an explicit
spatial_size when they need the large input.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/__main__.py`:
- Around line 458-463: The call to
get_node_filter_list(args.autotune_node_filter_list) runs even when autotune is
disabled, causing failures for plain PTQ runs; change the logic so
get_node_filter_list is invoked only when autotune_enabled is true (i.e., move
or wrap the get_node_filter_list call inside the same autotune_enabled branch
that calls apply_mode_presets), referencing the existing autotune_enabled
variable and args.autotune_node_filter_list so the filter list is only parsed
when apply_mode_presets/autotune are active.

In `@modelopt/onnx/quantization/autotune/autotuner_base.py`:
- Around line 522-530: The current code resolves producers via node.name which
is optional and non-unique; instead build a stable producer mapping and lookup
by output tensor identity: create a mapping from each node's output tensor names
to its index (e.g., iterate graph.nodes and for each node.outputs add mapping
output_name -> node_index), then in the loop use tensor.inputs[0].name to find
the producer index via that output_name mapping; as a fallback, if
tensor.inputs[0] has no name or isn’t found, fall back to scanning graph.nodes
to match the producing output object/identity to find its index (use symbols
node_name_to_idx -> replace with output_name_to_idx, graph.nodes, tensor.inputs,
covered, quantized_node_indices, nodes_to_quantize).

In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 220-223: The code creates a persistent temp folder when output_dir
is not provided (using tempfile.mkdtemp() and Path(...).mkdir()) but never
returns or cleans it up; change this to use tempfile.TemporaryDirectory() (or
create and track the TemporaryDirectory object) so the temp dir is removed when
done, or explicitly return the created path and document ownership so callers
can clean it up; update the logic around output_dir, tempfile.mkdtemp(), and the
output_dir.mkdir(...) call (or add a cleanup hook) so temp dirs are not
orphaned.

In `@modelopt/onnx/quantization/int8.py`:
- Around line 199-203: The code currently only calls
expand_node_names_from_patterns(graph, nodes_to_quantize) inside the if not
autotune branch, so regex patterns in nodes_to_quantize are not expanded when
autotune is True; change the logic so nodes_to_quantize =
expand_node_names_from_patterns(graph, nodes_to_quantize) is executed
unconditionally (before or independent of the autotune branch) so that
nodes_to_quantize patterns are expanded into concrete node names even when
autotune is enabled, keeping subsequent behavior the same.

In `@modelopt/onnx/quantization/quantize.py`:
- Around line 576-580: The code in quantize(...) is overwriting the
caller-provided op_types_to_quantize with the full autotuner list returned by
get_ort_quantization_config(); change this so that when autotune is enabled you
intersect the autotuner's op-type set with the caller's op_types_to_quantize (or
raise an error if both cannot be combined) instead of assigning it
wholesale—locate the block where get_ort_quantization_config() is called and
nodes_to_quantize_autotune, op_types_to_quantize, no_quantize_inputs are
unpacked and update it to compute the intersection (or validate mutual
exclusivity) between the returned op types and the existing op_types_to_quantize
argument before proceeding.
- Around line 270-277: init_benchmark_instance() can return None on failure but
the caller currently ignores that; modify the call so you capture its return
(e.g., benchmark = init_benchmark_instance(...)) and immediately fail fast if
benchmark is falsy by raising an exception or returning an error before
continuing into the autotune workflow; ensure the error includes context (that
TensorRT benchmark initialization failed) and any relevant inputs
(use_trtexec/trtexec_args) so upstream logic does not proceed when
init_benchmark_instance() fails.

In `@tests/gpu/onnx/quantization/test_autotune_quantization_integration.py`:
- Around line 76-81: The GPU test is too heavy: reduce the benchmark warmup and
run counts and trim the autotuner's scheme search; call init_benchmark_instance
with much smaller workload (e.g., warmup=1 and runs=3) instead of defaults and
pass a reduced search limit to region_pattern_autotuning_workflow (e.g.,
schemes_per_region=5 or max_schemes_per_region=5) when constructing autotuner
for onnx_model (keep quant_type="int8" and default_dq_dtype="float16"); update
the init_benchmark_instance(...) and region_pattern_autotuning_workflow(...)
calls accordingly so the test completes in seconds.
- Around line 57-58: The test incorrectly calls pathlib.Path.replace() (a
filesystem rename) on onnx_path; change the output_path computation so it
performs string-like suffix replacement instead, e.g. use
onnx_path.with_suffix(".quant.onnx") or build the filename via onnx_path.parent
/ (onnx_path.stem + ".quant.onnx") when setting output_path (do not call
Path.replace).

---

Outside diff comments:
In `@modelopt/onnx/quantization/int8.py`:
- Around line 217-231: The current cache filter only marks nodes whose inputs
appear in act_scales_dict (iq_quantized_nodes), which misses producer nodes
added by autotune that have their *outputs* cached; update the logic that builds
iq_quantized_nodes in the block guarded by calibration_cache_path so it also
considers node outputs (after normalizing act_scales_dict keys the same way you
do into quantized_tensors) in addition to inputs, then compute the intersection
with nodes_to_quantize exactly as before (when not autotune) so any node that
either consumes or produces a cached tensor remains eligible for quantization;
adjust references to graph.nodes, iq_quantized_nodes, quantized_tensors,
nodes_to_quantize, calibration_cache_path, act_scales_dict and autotune
accordingly.

---

Nitpick comments:
In `@tests/_test_utils/onnx/quantization/autotune/models.py`:
- Around line 59-95: The test helper _create_simple_resnet18_model() uses a huge
default input (1,3,1024,1024) which makes GPU autotune tests slow; change the
function to accept a spatial_size parameter with a much smaller default (e.g.,
64 or 128), use that parameter when constructing input_tensor and any
shape-related comments, and keep the model classes (_BasicBlock and _Model)
unchanged; also update any tests that call _create_simple_resnet18_model() to
pass an explicit spatial_size when they need the large input.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02a656a5-64f7-4cb8-a72d-605282bacb60

📥 Commits

Reviewing files that changed from the base of the PR and between 52b9c31 and 0a32bea.

📒 Files selected for processing (13)

modelopt/onnx/op_types.py
modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/autotune/__init__.py
modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/autotuner_base.py
modelopt/onnx/quantization/autotune/workflows.py
modelopt/onnx/quantization/fp8.py
modelopt/onnx/quantization/graph_utils.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/ort_utils.py
modelopt/onnx/quantization/quantize.py
tests/_test_utils/onnx/quantization/autotune/models.py
tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/onnx/quantization/graph_utils.py

modelopt/onnx/quantization/__main__.py

modelopt/onnx/quantization/autotune/autotuner_base.py

modelopt/onnx/quantization/autotune/workflows.py

modelopt/onnx/quantization/int8.py

modelopt/onnx/quantization/quantize.py

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

coderabbitai

♻️ Duplicate comments (1)

modelopt/onnx/quantization/__main__.py (1)

458-462: ⚠️ Potential issue | 🟡 Minor

Only load the filter list when autotune is enabled.

get_node_filter_list is called unconditionally at line 462, but it should only be invoked when autotune_enabled is True. Passing --autotune_node_filter_list without --autotune will currently attempt to validate and load the file unnecessarily.

🐛 Suggested fix

     # Autotune configs
     autotune_enabled = args.autotune is not None
     if autotune_enabled:
         apply_mode_presets(args)
-    autotune_node_filter_list = get_node_filter_list(args.autotune_node_filter_list)
+        autotune_node_filter_list = get_node_filter_list(args.autotune_node_filter_list)
+    else:
+        autotune_node_filter_list = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/__main__.py` around lines 458 - 462, The call to
get_node_filter_list(args.autotune_node_filter_list) runs regardless of
autotune_enabled; change the logic so get_node_filter_list is only invoked when
autotune_enabled is True (i.e., move or wrap the call inside the same if block
that calls apply_mode_presets), using the existing variables autotune_enabled,
apply_mode_presets, and args.autotune_node_filter_list to ensure the filter file
is only validated/loaded when --autotune is enabled.

🧹 Nitpick comments (1)

modelopt/onnx/quantization/autotune/__main__.py (1)

121-139: Type hint and encoding improvements.

Two minor issues:

The parameter node_filter_list_path is typed as str but accepts None (checked at line 131). Should be str | None.
The open() call should specify encoding="utf-8" for cross-platform consistency.

♻️ Suggested fix

-def get_node_filter_list(node_filter_list_path: str) -> list | None:
+def get_node_filter_list(node_filter_list_path: str | None) -> list | None:
     """Extract node filter list from node filters path.

     Args:
         node_filter_list_path: Path to a file containing wildcard patterns to filter ONNX nodes (one pattern per line).

     Returns:
         Node filter list
     """
     node_filter_list = None
     if node_filter_list_path:
         filter_file = validate_file_path(node_filter_list_path, "Node filter list file")
         if filter_file:
-            with open(filter_file) as f:
+            with open(filter_file, encoding="utf-8") as f:
                 node_filter_list = [
                     line.strip() for line in f if line.strip() and not line.strip().startswith("#")
                 ]
             logger.info(f"Loaded {len(node_filter_list)} filter patterns from {filter_file}")
     return node_filter_list

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/__main__.py` around lines 121 - 139,
Update the get_node_filter_list signature to accept an optional path (change
parameter type to str | None) and ensure file reads are UTF-8 encoded; inside
get_node_filter_list (which calls validate_file_path) adjust the annotation for
node_filter_list_path to allow None and open the returned filter_file with
encoding="utf-8" when building node_filter_list, preserving the existing
comment/skip logic and logging behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/onnx/quantization/__main__.py`:
- Around line 458-462: The call to
get_node_filter_list(args.autotune_node_filter_list) runs regardless of
autotune_enabled; change the logic so get_node_filter_list is only invoked when
autotune_enabled is True (i.e., move or wrap the call inside the same if block
that calls apply_mode_presets), using the existing variables autotune_enabled,
apply_mode_presets, and args.autotune_node_filter_list to ensure the filter file
is only validated/loaded when --autotune is enabled.

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 121-139: Update the get_node_filter_list signature to accept an
optional path (change parameter type to str | None) and ensure file reads are
UTF-8 encoded; inside get_node_filter_list (which calls validate_file_path)
adjust the annotation for node_filter_list_path to allow None and open the
returned filter_file with encoding="utf-8" when building node_filter_list,
preserving the existing comment/skip logic and logging behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3a0d6d32-0e48-48d1-a1c2-b544bc1f3e60

📥 Commits

Reviewing files that changed from the base of the PR and between 0a32bea and 7730b51.

📒 Files selected for processing (3)

modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/autotune/__init__.py
modelopt/onnx/quantization/autotune/__main__.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/onnx/quantization/autotune/init.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/autotuner_base.py`:
- Around line 503-528: ResolvedInsertionPoint entries that are pure tensor-level
locations with no consumers are dropped because the loop only expands to
consumer (node,input) pairs; to fix, when iterating resolved_ips detect
tensor-level insertions whose graph.tensor_users_map.get(ip.tensor_name, []) is
empty and then locate the producer node that outputs that tensor (scan
graph.nodes for a node with an output whose name equals ip.tensor_name or use an
existing producer lookup) and add the corresponding (producer_idx, output_index)
into covered and producer_idx into quantized_node_indices so the producer is
preserved in get_ort_quantization_config() (refer to resolved_ips, covered,
graph.tensor_users_map, node_name_to_idx, quantized_node_indices,
get_ort_quantization_config(), export_onnx()).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a001dc52-7976-4fb8-95ca-653dcb5f056c

📥 Commits

Reviewing files that changed from the base of the PR and between 7730b51 and eb0e064.

📒 Files selected for processing (1)

modelopt/onnx/quantization/autotune/autotuner_base.py

modelopt/onnx/quantization/autotune/autotuner_base.py

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

modelopt/onnx/quantization/__main__.py

modelopt/onnx/quantization/quantize.py

modelopt/onnx/quantization/autotune/workflows.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

modelopt/onnx/quantization/quantize.py (1)

606-611: ⚠️ Potential issue | 🟠 Major

Preserve the caller’s op_types_to_quantize filter in autotune mode.

_find_nodes_to_quantize_autotune() returns the autotuner’s full quantizable-op set, and rebinding op_types_to_quantize here discards any narrower filter the caller passed in. quantize(..., autotune=True, op_types_to_quantize=["Conv"]) can still quantize other op types once Line 641 forwards the overwritten list. Intersect the autotuner result with the caller filter, or reject the combination explicitly.

Also applies to: 641-641

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/quantize.py`:
- Around line 49-53: The module-level imports of MODE_PRESETS,
init_benchmark_instance, and region_pattern_autotuning_workflow are loading
benchmark dependencies even when autotune=False; move MODE_PRESETS into a
lightweight defaults submodule (e.g., quantization_defaults) that has no
benchmark imports, and remove the top-level imports of init_benchmark_instance
and region_pattern_autotuning_workflow; then perform lazy imports of
init_benchmark_instance and region_pattern_autotuning_workflow inside the
autotune branch of the quantize() function so those benchmark modules are only
loaded when autotune is requested.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4bbdd21b-62e4-43e4-a5a6-baef6b4dc459

📥 Commits

Reviewing files that changed from the base of the PR and between eb0e064 and 7cc54a5.

📒 Files selected for processing (5)

modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/workflows.py
modelopt/onnx/quantization/quantize.py
tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

modelopt/onnx/quantization/quantize.py

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

modelopt-bot · 2026-03-09T21:43:22Z

Summary of Remaining Actionable Items

Thanks for all the work on this PR! I've reviewed the conversation history and identified the following items that still need attention:

🔴 Critical Issues (Should Address)

int8.py & fp8.py - nodes_to_exclude None handling (Lines ~217-232 in fp8.py, ~161-174 in int8.py)
- Issue: When enable_gemv_detection_for_trt is True and autotune is False, the code calls nodes_to_exclude.extend() before validating that nodes_to_exclude is not None
- Fix: Add nodes_to_exclude = nodes_to_exclude or [] before the extend call
- Status: This was flagged in the Mar 2 review but needs verification in current code
quantize.py - op_types_to_quantize overwriting in autotune mode (Lines ~606-611)
- Issue: When autotune is enabled, the code overwrites the caller-provided op_types_to_quantize with the full autotuner list, discarding any narrower filter the caller passed in
- Fix: Intersect the autotuner's op-type set with the caller's op_types_to_quantize filter, or raise an error if both cannot be combined
- Status: Flagged in latest Mar 9 review by CodeRabbit

🟡 Minor Issues (Nice to Have)

test_region.py - Duplicate license block (Lines 16-21)
- Issue: Duplicate Apache license text block (copy-paste error)
- Fix: Remove the redundant license disclaimer block (lines 16-21)
- Status: Flagged Mar 2, should be a quick fix
quantize.py - Lazy imports for benchmark modules
- Issue: Module-level imports of benchmark dependencies load even when autotune=False
- Fix: Move init_benchmark_instance and region_pattern_autotuning_workflow imports inside the autotune branch for lazy loading
- Status: Suggestion from Mar 9 CodeRabbit review

Next Steps

Could you please address items 1 and 2 (the critical issues) before we proceed? Items 3 and 4 are lower priority but would be nice to have.

Let me know if you need any clarification on these points!

Generated by review bot - please address these items before merge

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

cjluo-nv · 2026-03-09T22:46:37Z

Responding to modelopt-bot's review:

Agree:

The Swish concern in get_activation_ops() is valid — Swish is not a standard ONNX operator (unlike HardSwish which is ONNX op 14+). It may silently never match anything. Worth verifying this is intentional or removing it.
Docstring nit for workflows.py is valid.
The cuda-python dependency question is worth answering in the PR description.

Disagree / nuance:

Skipping MHA/conv exclusions in autotune mode is intentional and correct. The static heuristics (find_nodes_from_mha_to_exclude, find_nodes_from_convs_to_exclude) exist as conservative fallbacks when you don't have profiling data — they're irrelevant when TRT's own runtime is deciding Q/DQ placement. A brief comment in the code would be sufficient to address the bot's confusion here; the logic is sound.

One thing the bot missed:

_find_nodes_to_quantize_autotune() in quantize.py has no docstring, inconsistent with the surrounding code style.

Overall I agree with the bot's verdict: the core integration is sound, get_ort_quantization_config() is a clean abstraction, and the temp-dir cleanup is handled correctly. Minor fixes needed.

…ize overwrite and other flags (should have the same behavior as pre-autotune) Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase · 2026-03-09T22:52:02Z

Summary of Remaining Actionable Items

Thanks for all the work on this PR! I've reviewed the conversation history and identified the following items that still need attention:

🔴 Critical Issues (Should Address)

int8.py & fp8.py - nodes_to_exclude None handling (Lines ~217-232 in fp8.py, ~161-174 in int8.py)

Issue: When enable_gemv_detection_for_trt is True and autotune is False, the code calls nodes_to_exclude.extend() before validating that nodes_to_exclude is not None

Fix: Add nodes_to_exclude = nodes_to_exclude or [] before the extend call

Status: This was flagged in the Mar 2 review but needs verification in current code

quantize.py - op_types_to_quantize overwriting in autotune mode (Lines ~606-611)

Issue: When autotune is enabled, the code overwrites the caller-provided op_types_to_quantize with the full autotuner list, discarding any narrower filter the caller passed in

Fix: Intersect the autotuner's op-type set with the caller's op_types_to_quantize filter, or raise an error if both cannot be combined

Status: Flagged in latest Mar 9 review by CodeRabbit

🟡 Minor Issues (Nice to Have)

test_region.py - Duplicate license block (Lines 16-21)

Issue: Duplicate Apache license text block (copy-paste error)

Fix: Remove the redundant license disclaimer block (lines 16-21)

Status: Flagged Mar 2, should be a quick fix

quantize.py - Lazy imports for benchmark modules

Issue: Module-level imports of benchmark dependencies load even when autotune=False

Fix: Move init_benchmark_instance and region_pattern_autotuning_workflow imports inside the autotune branch for lazy loading

Status: Suggestion from Mar 9 CodeRabbit review

Next Steps

Could you please address items 1 and 2 (the critical issues) before we proceed? Items 3 and 4 are lower priority but would be nice to have.

Let me know if you need any clarification on these points!

Generated by review bot - please address these items before merge

This is done in quantize.py#LN582 before calling quantize (which calls int8.py or fp8.py):

nodes_to_exclude = nodes_to_exclude or []

Fixed by adding the following after _find_nodes_to_quantize_autotune():

op_types_to_quantize = op_types_to_quantize or op_types_to_quantize_autotune

Done
Done. Module-level import is needing for integration workflow testing (patching of the Autotune model without needing a TRT re-run). The solution to avoid benchmark dependencies load even when autotune=False is to wrap those imports with a try-except check.

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase · 2026-03-09T22:59:14Z

Responding to modelopt-bot's review:

Agree:

The Swish concern in get_activation_ops() is valid — Swish is not a standard ONNX operator (unlike HardSwish which is ONNX op 14+). It may silently never match anything. Worth verifying this is intentional or removing it.

Docstring nit for workflows.py is valid.

The cuda-python dependency question is worth answering in the PR description.

Disagree / nuance:

Skipping MHA/conv exclusions in autotune mode is intentional and correct. The static heuristics (find_nodes_from_mha_to_exclude, find_nodes_from_convs_to_exclude) exist as conservative fallbacks when you don't have profiling data — they're irrelevant when TRT's own runtime is deciding Q/DQ placement. A brief comment in the code would be sufficient to address the bot's confusion here; the logic is sound.

One thing the bot missed:

_find_nodes_to_quantize_autotune() in quantize.py has no docstring, inconsistent with the surrounding code style.

Overall I agree with the bot's verdict: the core integration is sound, get_ort_quantization_config() is a clean abstraction, and the temp-dir cleanup is handled correctly. Minor fixes needed.

Replies:

Swish is a native ONNX op form opset 24: https://onnx.ai/onnx/operators/onnx__Swish.html
Not sure what this is for, likely outdated.
cuda-python concerns is outdated: [ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch #998
Added comment on MHA exclusiong function.
Added docstring in _find_nodes_to_quantize_autotune().

willg-nv · 2026-03-10T02:29:37Z

modelopt/onnx/quantization/autotune/__init__.py

 """

+# Expose Autotune modes and args
+from .__main__ import MODE_PRESETS, StoreWithExplicitFlag, get_node_filter_list


I think it is better to move get_node_filter_list() to common.py or utils.py?

willg-nv · 2026-03-10T03:33:28Z

modelopt/onnx/quantization/quantize.py

 from modelopt.onnx.op_types import is_data_dependent_shape_op
+
+try:
+    from modelopt.onnx.quantization.autotune.workflows import (


Why would internal importing cause ImportError?

gcunhase requested a review from a team as a code owner March 2, 2026 18:15

gcunhase requested a review from ajrasane March 2, 2026 18:15

gcunhase requested a review from cjluo-nv March 2, 2026 18:16

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/workflows.py Outdated Show resolved Hide resolved

modelopt/onnx/quantization/quantize.py Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/__main__.py Show resolved Hide resolved

cjluo-nv reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/quantize.py Outdated Show resolved Hide resolved

cjluo-nv requested a review from Copilot March 2, 2026 18:45

Copilot started reviewing on behalf of cjluo-nv March 2, 2026 18:46 View session

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

Copilot AI reviewed Mar 2, 2026

View reviewed changes

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/fp8.py Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/workflows.py Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/quantize.py Outdated Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/graph_utils.py Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

setup.py Outdated Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

tests/unit/onnx/quantization/autotune/test_region.py Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/quantization/quantize.py Outdated Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

modelopt/onnx/op_types.py Show resolved Hide resolved

modelopt-bot reviewed Mar 2, 2026

View reviewed changes

gcunhase changed the title ~~[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune~~ Draft: [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune Mar 2, 2026

gcunhase force-pushed the dev/gcunhasergio/autotune_real_qdq_scales branch from ab4c5a3 to c23208f Compare March 2, 2026 20:40

cjluo-nv requested a review from Copilot March 3, 2026 20:44

Copilot started reviewing on behalf of cjluo-nv March 3, 2026 20:44 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

gcunhase changed the title ~~Draft: [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune~~ [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune Mar 5, 2026

gcunhase mentioned this pull request Mar 6, 2026

[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch #998

Merged

ajrasane reviewed Mar 9, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__init__.py Outdated Show resolved Hide resolved

Initial autotune codebase

3554ecf

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase added 5 commits March 9, 2026 12:58

Added mode presets and additional autotune configurations

1dc03cd

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Fixed tmp_path in test

ddacbcb

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Fixed copilot comments

689a907

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Fix: skip rewiring in graph_utils if no index is found. This prevents…

b64322f

… silent corruption of the graph. Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Match args for preset mode default

0a32bea

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase force-pushed the dev/gcunhasergio/autotune_real_qdq_scales branch from c147979 to 0a32bea Compare March 9, 2026 16:58

ajrasane reviewed Mar 9, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/autotuner_base.py Outdated Show resolved Hide resolved

Exposed _StoreWithExplicitFlag

7730b51

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

ajrasane approved these changes Mar 9, 2026

View reviewed changes

Renamed new_ips to new_insertion_points

eb0e064

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/autotuner_base.py Show resolved Hide resolved

cjluo-nv requested a review from Copilot March 9, 2026 18:36

Copilot started reviewing on behalf of cjluo-nv March 9, 2026 18:36 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Address coderabbit and copilot issues + other minor issues

7cc54a5

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

modelopt/onnx/quantization/quantize.py Outdated Show resolved Hide resolved

tests/gpu/onnx/quantization/test_autotune_quantization_integration.py Show resolved Hide resolved

Address additional coderabbit and copilot issues

8634b74

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Added real scales test in the integration workflow

0d82f64

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Address additional copilot issues: includes fix for op_types_to_quant…

ee87330

…ize overwrite and other flags (should have the same behavior as pre-autotune) Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

nit: added docstring and comment

1a531b9

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

willg-nv reviewed Mar 10, 2026

View reviewed changes

cjluo-nv approved these changes Mar 10, 2026

View reviewed changes

	default_dq_dtype: Dtype for DequantizeLinear output; "float32" (default) or "float16".
	default_dq_dtype: Dtype for DequantizeLinear output; e.g. "float32" (default), "float16",
	"bfloat16", or any dtype supported by export_utils.resolve_dtype().

Conversation

gcunhase commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Additional information

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

modelopt-bot Mar 2, 2026

Choose a reason for hiding this comment

Overall Assessment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

modelopt-bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gcunhase Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

wentingj-nv Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gcunhase commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

codecov bot commented Mar 2, 2026 •

edited

Loading

willg-nv Mar 10, 2026 •

edited

Loading