Skip to content

Allow basename of dataset paths to match registered names#997

Open
AAnoosheh wants to merge 5 commits intomainfrom
aanoosheh/allow-dataset-kwargs
Open

Allow basename of dataset paths to match registered names#997
AAnoosheh wants to merge 5 commits intomainfrom
aanoosheh/allow-dataset-kwargs

Conversation

@AAnoosheh
Copy link
Contributor

@AAnoosheh AAnoosheh commented Mar 6, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added a small sample dataset entry (minipile_100_samples) and support for loading datasets from local filesystem paths with automatic detection and config override.
  • Chores

    • Improved local-path resolution and substring-based matching against registered dataset keys for consistent behavior.
  • Tests

    • Added a unit test to verify loading samples from a local dataset snapshot.
  • Documentation

    • Updated docs to describe local-path support, matching behavior, and updated function docstring.

@AAnoosheh AAnoosheh requested a review from a team as a code owner March 6, 2026 17:34
@AAnoosheh AAnoosheh requested a review from kevalmorabia97 March 6, 2026 17:34
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds local-filesystem dataset resolution: if a provided dataset identifier is a local path, derive a registered key from the path name, load the corresponding registered config and override its path to the local directory. Adds minipile_100_samples dataset config and a unit test for local-path loading.

Changes

Cohort / File(s) Summary
Dataset utils
modelopt/torch/utils/dataset_utils.py
Adds os import; detects local filesystem paths via os.path.exists; derives dataset name from path last segment and performs substring matching against registered keys; if matched, loads the registered config and overrides config["path"] with the local path; documents local-path behavior in get_dataset_samples; adds minipile_100_samples entry to SUPPORTED_DATASET_CONFIG with config and preprocess.
Tests
tests/unit/torch/utils/test_dataset_utils.py
Imports updated symbols and adds test_get_dataset_samples_minipile_local_path which downloads nanotron/minipile_100_samples snapshot, calls get_dataset_samples(local_path, num_samples=5), and asserts five non-empty string samples are returned.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant DatasetUtils
  participant Filesystem
  participant Registry
  participant DatasetLoader

  Client->>DatasetUtils: get_dataset_samples(name_or_path, ...)
  DatasetUtils->>Filesystem: os.path.exists(name_or_path)?
  alt path exists
    DatasetUtils->>DatasetUtils: dataset_name = last_segment(name_or_path)
    DatasetUtils->>Registry: find registered key by substring(dataset_name)
    alt registered key found
      Registry-->>DatasetUtils: config
      DatasetUtils->>DatasetUtils: config.path = name_or_path (override)
      DatasetUtils->>DatasetLoader: load_dataset(config, split)
      DatasetLoader-->>Client: samples
    else no registered key
      DatasetUtils->>DatasetLoader: load_from_local_path(name_or_path)
      DatasetLoader-->>Client: samples
    end
  else not a path
    DatasetUtils->>Registry: resolve registered key or auto-detect
    alt resolved
      Registry-->>DatasetUtils: config
      DatasetUtils->>DatasetLoader: load_dataset(config, split)
      DatasetLoader-->>Client: samples
    else auto-detect
      DatasetUtils->>DatasetLoader: auto_detect_and_load(name_or_path)
      DatasetLoader-->>Client: samples
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling dataset path basenames to match against registered dataset configuration names via substring matching.
Security Anti-Patterns ✅ Passed No CRITICAL security anti-patterns from SECURITY.md found in modified files. Code only adds dataset configuration and path resolution logic without unsafe deserialization or code execution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch aanoosheh/allow-dataset-kwargs

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 248-254: The code overwrites dataset_name with a manually parsed
basename which breaks unregistered local dataset paths and is not OS-portable;
instead keep dataset_path as the full local path, do not replace dataset_name,
and use os.path.basename() for any display/name extraction. Update the block
around dataset_name/dataset_path so dataset_path stays set to the full path when
os.path.exists(dataset_name) is true, and later when building config (used by
load_dataset) use dataset_path (not the truncated basename) for the "path" entry
unless the dataset is found in SUPPORTED_DATASET_CONFIG; replace any
rstrip("/").split("/")[-1] logic with os.path.basename(dataset_path) if you need
just the name for registration/display.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0e549af3-8813-487e-8da3-ccdbeb77a7b0

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and dd8f7b6.

📒 Files selected for processing (1)
  • modelopt/torch/utils/dataset_utils.py

@ChenhanYu ChenhanYu force-pushed the aanoosheh/allow-dataset-kwargs branch from 00246b1 to 10b0e68 Compare March 6, 2026 19:21
get_dataset_samples now checks if any key in SUPPORTED_DATASET_CONFIG is a
substring of dataset_name, so local paths like /hf-local/cais/mmlu correctly
match the registered 'mmlu' config without requiring an exact name match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the aanoosheh/allow-dataset-kwargs branch from 10b0e68 to 595f789 Compare March 6, 2026 19:21
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 72.25%. Comparing base (42482b1) to head (c3f25fe).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/utils/dataset_utils.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #997      +/-   ##
==========================================
+ Coverage   72.13%   72.25%   +0.11%     
==========================================
  Files         209      209              
  Lines       23631    23674      +43     
==========================================
+ Hits        17046    17105      +59     
+ Misses       6585     6569      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)

293-302: ⚠️ Potential issue | 🟠 Major

Unregistered local datasets will fail: path contains only basename.

When a local path is provided but no registered key matches, config["path"] is set to dataset_name (line 298), which has been truncated to just the basename at line 258. This causes load_dataset() to fail because the full path is lost.

For example, passing /data/my_custom_dataset results in config = {"path": "my_custom_dataset"} instead of the full path.

Proposed fix
     else:
         warn(
             f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. "
             "Auto-detecting format from column names."
         )
-        config = {"path": dataset_name}
+        config = {"path": dataset_path if dataset_path is not None else dataset_name}
         splits = _normalize_splits(split) if split is not None else ["train"]

         def _preprocess(sample: dict) -> str:
             return _auto_preprocess_sample(sample, dataset_name, tokenizer)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 293 - 302, The code sets
config["path"] to a truncated basename in the auto-detect branch, so local
dataset paths (e.g., /data/my_custom_dataset) lose their full path and
load_dataset() fails; fix by using the original full path variable (preserve the
input path before any basename truncation) when creating config (i.e., set
config["path"] = original_full_dataset_path or dataset_path instead of the
truncated dataset_name), leaving _preprocess (which calls
_auto_preprocess_sample) unchanged so auto-detection still works for
unregistered datasets.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)

255-261: ⚠️ Potential issue | 🟡 Minor

Use os.path.basename() for cross-platform compatibility.

Line 258 uses dataset_path.rstrip("/").split("/")[-1] which will fail on Windows where paths use backslash separators. The os module is already imported.

Proposed fix
     if os.path.exists(dataset_name):
         # Local path
         dataset_path = dataset_name
-        dataset_name = dataset_path.rstrip("/").split("/")[-1]
+        dataset_name = os.path.basename(os.path.normpath(dataset_path))
     else:
         dataset_path = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 261, The code that
derives the dataset folder name uses manual string splitting
(dataset_path.rstrip("/").split("/")[-1]) which is not cross-platform; replace
that logic with os.path.basename(dataset_path.rstrip(os.sep)) (or simply
os.path.basename(dataset_path) if you prefer) so dataset_name is computed
portably; update the block that sets dataset_path/dataset_name in
dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Line 262: The current selection logic uses a substring test when computing
registered_key (next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name),
None)), which can produce incorrect matches (e.g., 'pile' matching
'minipile_100_samples'); change the match to a safe strategy such as exact
equality (k == dataset_name) or, if prefix/suffix matching is required, sort
keys by length descending and check for full-token or prefix matches to prefer
the most specific key; update the code that sets registered_key and any
downstream logic that assumes that key (referencing SUPPORTED_DATASET_CONFIG,
registered_key, and dataset_name) so the correct dataset configuration is always
chosen.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The unit test test_get_dataset_samples_minipile_local_path
currently performs a network call via snapshot_download and belongs in
integration tests; change it to either move the test to tests/examples
(integration) or mock huggingface_hub.snapshot_download so the unit test is fast
and CPU-only, keeping the call to get_dataset_samples intact. Also add a new
unit test that creates a local directory whose basename does NOT match any key
in SUPPORTED_DATASET_CONFIG and calls get_dataset_samples with that local path
to exercise the auto-detection branch in dataset_utils.get_dataset_samples (the
code that handles unregistered local dataset paths), asserting it returns
expected text samples.

---

Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 293-302: The code sets config["path"] to a truncated basename in
the auto-detect branch, so local dataset paths (e.g., /data/my_custom_dataset)
lose their full path and load_dataset() fails; fix by using the original full
path variable (preserve the input path before any basename truncation) when
creating config (i.e., set config["path"] = original_full_dataset_path or
dataset_path instead of the truncated dataset_name), leaving _preprocess (which
calls _auto_preprocess_sample) unchanged so auto-detection still works for
unregistered datasets.

---

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-261: The code that derives the dataset folder name uses manual
string splitting (dataset_path.rstrip("/").split("/")[-1]) which is not
cross-platform; replace that logic with
os.path.basename(dataset_path.rstrip(os.sep)) (or simply
os.path.basename(dataset_path) if you prefer) so dataset_name is computed
portably; update the block that sets dataset_path/dataset_name in
dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7bc85fe-223b-4370-833c-d061a83ee250

📥 Commits

Reviewing files that changed from the base of the PR and between 595f789 and ed58dde.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment on lines +86 to +108
def test_get_dataset_samples_minipile_local_path(tmp_path):
"""Test that get_dataset_samples resolves a local path for minipile_100_samples.

Downloads nanotron/minipile_100_samples via snapshot_download and passes the
local directory path to get_dataset_samples. This exercises:
- local path detection (os.path.exists branch)
- substring matching of 'minipile_100_samples' in SUPPORTED_DATASET_CONFIG
- the 'text' column preprocessor for the registered config
"""
from huggingface_hub import snapshot_download

local_dir = tmp_path / "nanotron" / "minipile_100_samples"
snapshot_download(
repo_id="nanotron/minipile_100_samples",
repo_type="dataset",
local_dir=str(local_dir),
)

samples = get_dataset_samples(str(local_dir), num_samples=5)

assert isinstance(samples, list)
assert len(samples) == 5
assert all(isinstance(s, str) and len(s) > 0 for s in samples)
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are existing patterns for marking slow/network tests
rg -n "pytest\.mark\.(slow|network|integration)" tests/ || echo "No existing slow/network markers found"

Repository: NVIDIA/Model-Optimizer

Length of output: 104


🏁 Script executed:

# First, let's look at the full test file to understand the context
wc -l tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 114


🏁 Script executed:

# Check test file structure and see all tests
cat -n tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5341


🏁 Script executed:

# Check if there are integration test directories as mentioned in guidelines
ls -la tests/ | grep -E "^d" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 524


🏁 Script executed:

# Search for any existing tests that use snapshot_download or HF hub
rg -n "snapshot_download|load_dataset" tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 622


🏁 Script executed:

# Search for any test mentioning "unregistered" or testing unknown datasets
rg -n "unregistered|unknown.*dataset|test.*dataset.*path" tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 128


🏁 Script executed:

# Look at the get_dataset_samples implementation to understand what it does
fd -n "dataset_utils.py" src/ --exec cat -n {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 296


🏁 Script executed:

# Search more broadly for dataset_utils implementation
fd -n "dataset_utils.py" --type f

Repository: NVIDIA/Model-Optimizer

Length of output: 296


🏁 Script executed:

# Find dataset_utils.py correctly
fd "dataset_utils\.py" --type f

Repository: NVIDIA/Model-Optimizer

Length of output: 286


🏁 Script executed:

# Once we find it, look at the implementation
find . -name "dataset_utils.py" -type f | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 105


🏁 Script executed:

# Check the size and read the dataset_utils.py implementation
wc -l modelopt/torch/utils/dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 107


🏁 Script executed:

# Read the get_dataset_samples function
cat -n modelopt/torch/utils/dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 30590


🏁 Script executed:

# Search for any existing tests of unregistered datasets
rg -A 10 "unregistered|SUPPORTED_DATASET_CONFIG" tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 474


Move network-dependent test to integration tests or mock external dependencies.

This test downloads from HuggingFace Hub via snapshot_download, requiring network connectivity and violating the coding guideline that tests/unit/**/*.py must be fast CPU-based tests. The test belongs in tests/examples (integration tests) or should mock snapshot_download to avoid network calls.

Additionally, add a test for unregistered local dataset paths (where the basename does not match any registered key in SUPPORTED_DATASET_CONFIG). Currently, only the registered minipile_100_samples path is tested, leaving the auto-detection code path (lines 294–302 of dataset_utils.py) untested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 86 - 108, The unit
test test_get_dataset_samples_minipile_local_path currently performs a network
call via snapshot_download and belongs in integration tests; change it to either
move the test to tests/examples (integration) or mock
huggingface_hub.snapshot_download so the unit test is fast and CPU-only, keeping
the call to get_dataset_samples intact. Also add a new unit test that creates a
local directory whose basename does NOT match any key in
SUPPORTED_DATASET_CONFIG and calls get_dataset_samples with that local path to
exercise the auto-detection branch in dataset_utils.get_dataset_samples (the
code that handles unregistered local dataset paths), asserting it returns
expected text samples.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AAnoosheh , so this test likely needs to be moved to gpu_test. I think the basic torch test does not have datasets installed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

- Add minipile_100_samples to SUPPORTED_DATASET_CONFIG with text column preprocessor
- Add unit test covering local path + substring matching via snapshot_download

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the aanoosheh/allow-dataset-kwargs branch from ed58dde to d8b2d4e Compare March 6, 2026 20:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/unit/torch/utils/test_dataset_utils.py (1)

86-108: ⚠️ Potential issue | 🟠 Major

This unit test is still network-bound and won't catch the resolver bug.

snapshot_download() makes tests/unit/... depend on the Hub, and the assertions only check for five non-empty strings. Because both the pile and minipile_100_samples configs currently preprocess sample["text"], this can still pass when the resolver selects the wrong key. Please stub datasets.load_dataset/Hub calls and assert the resolved kwargs instead (for example, that no inherited name="v1.0" is passed), then add a second case whose directory basename is not in SUPPORTED_DATASET_CONFIG so the unregistered local-path branch is exercised too.

Suggested direction
-def test_get_dataset_samples_minipile_local_path(tmp_path):
+def test_get_dataset_samples_minipile_local_path(monkeypatch, tmp_path):
-    from huggingface_hub import snapshot_download
-
     local_dir = tmp_path / "nanotron" / "minipile_100_samples"
-    snapshot_download(
-        repo_id="nanotron/minipile_100_samples",
-        repo_type="dataset",
-        local_dir=str(local_dir),
-    )
+    local_dir.mkdir(parents=True)
+
+    captured: dict[str, object] = {}
+
+    def fake_load_dataset(*, split, streaming, **kwargs):
+        captured.update({"split": split, "streaming": streaming, **kwargs})
+        return iter([{"text": f"sample-{i}"} for i in range(5)])
+
+    import datasets
+    monkeypatch.setattr(datasets, "load_dataset", fake_load_dataset)

     samples = get_dataset_samples(str(local_dir), num_samples=5)

-    assert isinstance(samples, list)
-    assert len(samples) == 5
-    assert all(isinstance(s, str) and len(s) > 0 for s in samples)
+    assert samples == [f"sample-{i}" for i in range(5)]
+    assert captured["path"] == str(local_dir)
+    assert "name" not in captured

As per coding guidelines, tests/unit/**/*.py: Unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 86 - 108, The test
currently performs network I/O via snapshot_download and only asserts result
strings; instead, mock/stub external Hub calls and datasets.load_dataset so the
test runs offline and can assert how get_dataset_samples resolves kwargs: patch
huggingface_hub.snapshot_download and datasets.load_dataset (or the internal
call site used by get_dataset_samples) to return a fake local_dir and a fake
dataset object, then call get_dataset_samples with that local path and assert
the resolved kwargs passed into datasets.load_dataset (e.g., that no inherited
name="v1.0" or incorrect config is forwarded and that the selected config equals
the one matched from SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also
add a second case where the directory basename is not present in
SUPPORTED_DATASET_CONFIG to exercise the unregistered-local-path branch and
assert datasets.load_dataset is called with the generic/local-path kwargs
instead.
modelopt/torch/utils/dataset_utils.py (1)

255-262: ⚠️ Potential issue | 🟠 Major

Keep the full local path separate from the lookup key.

This block truncates a local path to rstrip("/").split("/")[-1] and then resolves configs with substring matching. That breaks unregistered local datasets later on Lines 294-298 because load_dataset() receives only the basename, and it also makes minipile_100_samples resolve to the earlier pile entry in the current config order. Preserve dataset_path as-is, derive a normalized basename for lookup, and match that key exactly.

Suggested fix
-    if os.path.exists(dataset_name):
-        # Local path
-        dataset_path = dataset_name
-        dataset_name = dataset_path.rstrip("/").split("/")[-1]
-    else:
-        dataset_path = None
-
-    registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name), None)
+    dataset_path: str | None = None
+    dataset_key = dataset_name
+    if os.path.exists(dataset_name):
+        dataset_path = dataset_name
+        dataset_key = os.path.basename(os.path.normpath(dataset_path))
+
+    registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k == dataset_key), None)
@@
-            f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. "
+            f"Dataset '{dataset_key}' is not in SUPPORTED_DATASET_CONFIG. "
             "Auto-detecting format from column names."
         )
-        config = {"path": dataset_name}
+        config = {"path": dataset_path or dataset_name}
         splits = _normalize_splits(split) if split is not None else ["train"]

         def _preprocess(sample: dict) -> str:
-            return _auto_preprocess_sample(sample, dataset_name, tokenizer)
+            return _auto_preprocess_sample(sample, dataset_key, tokenizer)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 262, The current
code mutates dataset_name to the basename which loses the full local path and
uses substring matching against SUPPORTED_DATASET_CONFIG; revert this by keeping
dataset_path set to the original full path (do not overwrite dataset_name),
derive a separate normalized basename (e.g., use
os.path.basename(dataset_name.rstrip("/"))) for lookup, and compute
registered_key by exact equality against entries in SUPPORTED_DATASET_CONFIG
using that basename (not substring containment); ensure later calls such as
load_dataset() are passed dataset_path when it exists (the full local path) and
only use registered_key/basename for config lookup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 231-234: The docstring for the dataset_name parameter overstates
supported inputs (mentions .jsonl.gz and uses "mmlu" as an example) while the
implementation only special-cases .jsonl and SUPPORTED_DATASET_CONFIG does not
contain "mmlu"; update the docstring to accurately reflect current behavior by
removing .jsonl.gz from the supported list and replacing the "mmlu" example with
an actual registered key from SUPPORTED_DATASET_CONFIG (or a generic phrase like
"one of the keys in SUPPORTED_DATASET_CONFIG"); ensure you update the docstring
near the dataset_name parameter and reference SUPPORTED_DATASET_CONFIG so docs
and code match.

---

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-262: The current code mutates dataset_name to the basename
which loses the full local path and uses substring matching against
SUPPORTED_DATASET_CONFIG; revert this by keeping dataset_path set to the
original full path (do not overwrite dataset_name), derive a separate normalized
basename (e.g., use os.path.basename(dataset_name.rstrip("/"))) for lookup, and
compute registered_key by exact equality against entries in
SUPPORTED_DATASET_CONFIG using that basename (not substring containment); ensure
later calls such as load_dataset() are passed dataset_path when it exists (the
full local path) and only use registered_key/basename for config lookup.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The test currently performs network I/O via
snapshot_download and only asserts result strings; instead, mock/stub external
Hub calls and datasets.load_dataset so the test runs offline and can assert how
get_dataset_samples resolves kwargs: patch huggingface_hub.snapshot_download and
datasets.load_dataset (or the internal call site used by get_dataset_samples) to
return a fake local_dir and a fake dataset object, then call get_dataset_samples
with that local path and assert the resolved kwargs passed into
datasets.load_dataset (e.g., that no inherited name="v1.0" or incorrect config
is forwarded and that the selected config equals the one matched from
SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also add a second case
where the directory basename is not present in SUPPORTED_DATASET_CONFIG to
exercise the unregistered-local-path branch and assert datasets.load_dataset is
called with the generic/local-path kwargs instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b3c8d7f2-c997-4ae2-ae2e-3bfadaf8d90d

📥 Commits

Reviewing files that changed from the base of the PR and between ed58dde and d8b2d4e.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the aanoosheh/allow-dataset-kwargs branch from 01c2508 to 63f9c7d Compare March 6, 2026 23:32
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
@AAnoosheh AAnoosheh enabled auto-merge (squash) March 6, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants