Allow basename of dataset paths to match registered names#997
Allow basename of dataset paths to match registered names#997
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds local-filesystem dataset resolution: if a provided dataset identifier is a local path, derive a registered key from the path name, load the corresponding registered config and override its Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DatasetUtils
participant Filesystem
participant Registry
participant DatasetLoader
Client->>DatasetUtils: get_dataset_samples(name_or_path, ...)
DatasetUtils->>Filesystem: os.path.exists(name_or_path)?
alt path exists
DatasetUtils->>DatasetUtils: dataset_name = last_segment(name_or_path)
DatasetUtils->>Registry: find registered key by substring(dataset_name)
alt registered key found
Registry-->>DatasetUtils: config
DatasetUtils->>DatasetUtils: config.path = name_or_path (override)
DatasetUtils->>DatasetLoader: load_dataset(config, split)
DatasetLoader-->>Client: samples
else no registered key
DatasetUtils->>DatasetLoader: load_from_local_path(name_or_path)
DatasetLoader-->>Client: samples
end
else not a path
DatasetUtils->>Registry: resolve registered key or auto-detect
alt resolved
Registry-->>DatasetUtils: config
DatasetUtils->>DatasetLoader: load_dataset(config, split)
DatasetLoader-->>Client: samples
else auto-detect
DatasetUtils->>DatasetLoader: auto_detect_and_load(name_or_path)
DatasetLoader-->>Client: samples
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 248-254: The code overwrites dataset_name with a manually parsed
basename which breaks unregistered local dataset paths and is not OS-portable;
instead keep dataset_path as the full local path, do not replace dataset_name,
and use os.path.basename() for any display/name extraction. Update the block
around dataset_name/dataset_path so dataset_path stays set to the full path when
os.path.exists(dataset_name) is true, and later when building config (used by
load_dataset) use dataset_path (not the truncated basename) for the "path" entry
unless the dataset is found in SUPPORTED_DATASET_CONFIG; replace any
rstrip("/").split("/")[-1] logic with os.path.basename(dataset_path) if you need
just the name for registration/display.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0e549af3-8813-487e-8da3-ccdbeb77a7b0
📒 Files selected for processing (1)
modelopt/torch/utils/dataset_utils.py
00246b1 to
10b0e68
Compare
get_dataset_samples now checks if any key in SUPPORTED_DATASET_CONFIG is a substring of dataset_name, so local paths like /hf-local/cais/mmlu correctly match the registered 'mmlu' config without requiring an exact name match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
10b0e68 to
595f789
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #997 +/- ##
==========================================
+ Coverage 72.13% 72.25% +0.11%
==========================================
Files 209 209
Lines 23631 23674 +43
==========================================
+ Hits 17046 17105 +59
+ Misses 6585 6569 -16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)
293-302:⚠️ Potential issue | 🟠 MajorUnregistered local datasets will fail: path contains only basename.
When a local path is provided but no registered key matches,
config["path"]is set todataset_name(line 298), which has been truncated to just the basename at line 258. This causesload_dataset()to fail because the full path is lost.For example, passing
/data/my_custom_datasetresults inconfig = {"path": "my_custom_dataset"}instead of the full path.Proposed fix
else: warn( f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. " "Auto-detecting format from column names." ) - config = {"path": dataset_name} + config = {"path": dataset_path if dataset_path is not None else dataset_name} splits = _normalize_splits(split) if split is not None else ["train"] def _preprocess(sample: dict) -> str: return _auto_preprocess_sample(sample, dataset_name, tokenizer)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 293 - 302, The code sets config["path"] to a truncated basename in the auto-detect branch, so local dataset paths (e.g., /data/my_custom_dataset) lose their full path and load_dataset() fails; fix by using the original full path variable (preserve the input path before any basename truncation) when creating config (i.e., set config["path"] = original_full_dataset_path or dataset_path instead of the truncated dataset_name), leaving _preprocess (which calls _auto_preprocess_sample) unchanged so auto-detection still works for unregistered datasets.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
255-261:⚠️ Potential issue | 🟡 MinorUse
os.path.basename()for cross-platform compatibility.Line 258 uses
dataset_path.rstrip("/").split("/")[-1]which will fail on Windows where paths use backslash separators. Theosmodule is already imported.Proposed fix
if os.path.exists(dataset_name): # Local path dataset_path = dataset_name - dataset_name = dataset_path.rstrip("/").split("/")[-1] + dataset_name = os.path.basename(os.path.normpath(dataset_path)) else: dataset_path = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 261, The code that derives the dataset folder name uses manual string splitting (dataset_path.rstrip("/").split("/")[-1]) which is not cross-platform; replace that logic with os.path.basename(dataset_path.rstrip(os.sep)) (or simply os.path.basename(dataset_path) if you prefer) so dataset_name is computed portably; update the block that sets dataset_path/dataset_name in dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Line 262: The current selection logic uses a substring test when computing
registered_key (next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name),
None)), which can produce incorrect matches (e.g., 'pile' matching
'minipile_100_samples'); change the match to a safe strategy such as exact
equality (k == dataset_name) or, if prefix/suffix matching is required, sort
keys by length descending and check for full-token or prefix matches to prefer
the most specific key; update the code that sets registered_key and any
downstream logic that assumes that key (referencing SUPPORTED_DATASET_CONFIG,
registered_key, and dataset_name) so the correct dataset configuration is always
chosen.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The unit test test_get_dataset_samples_minipile_local_path
currently performs a network call via snapshot_download and belongs in
integration tests; change it to either move the test to tests/examples
(integration) or mock huggingface_hub.snapshot_download so the unit test is fast
and CPU-only, keeping the call to get_dataset_samples intact. Also add a new
unit test that creates a local directory whose basename does NOT match any key
in SUPPORTED_DATASET_CONFIG and calls get_dataset_samples with that local path
to exercise the auto-detection branch in dataset_utils.get_dataset_samples (the
code that handles unregistered local dataset paths), asserting it returns
expected text samples.
---
Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 293-302: The code sets config["path"] to a truncated basename in
the auto-detect branch, so local dataset paths (e.g., /data/my_custom_dataset)
lose their full path and load_dataset() fails; fix by using the original full
path variable (preserve the input path before any basename truncation) when
creating config (i.e., set config["path"] = original_full_dataset_path or
dataset_path instead of the truncated dataset_name), leaving _preprocess (which
calls _auto_preprocess_sample) unchanged so auto-detection still works for
unregistered datasets.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-261: The code that derives the dataset folder name uses manual
string splitting (dataset_path.rstrip("/").split("/")[-1]) which is not
cross-platform; replace that logic with
os.path.basename(dataset_path.rstrip(os.sep)) (or simply
os.path.basename(dataset_path) if you prefer) so dataset_name is computed
portably; update the block that sets dataset_path/dataset_name in
dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a7bc85fe-223b-4370-833c-d061a83ee250
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
| def test_get_dataset_samples_minipile_local_path(tmp_path): | ||
| """Test that get_dataset_samples resolves a local path for minipile_100_samples. | ||
|
|
||
| Downloads nanotron/minipile_100_samples via snapshot_download and passes the | ||
| local directory path to get_dataset_samples. This exercises: | ||
| - local path detection (os.path.exists branch) | ||
| - substring matching of 'minipile_100_samples' in SUPPORTED_DATASET_CONFIG | ||
| - the 'text' column preprocessor for the registered config | ||
| """ | ||
| from huggingface_hub import snapshot_download | ||
|
|
||
| local_dir = tmp_path / "nanotron" / "minipile_100_samples" | ||
| snapshot_download( | ||
| repo_id="nanotron/minipile_100_samples", | ||
| repo_type="dataset", | ||
| local_dir=str(local_dir), | ||
| ) | ||
|
|
||
| samples = get_dataset_samples(str(local_dir), num_samples=5) | ||
|
|
||
| assert isinstance(samples, list) | ||
| assert len(samples) == 5 | ||
| assert all(isinstance(s, str) and len(s) > 0 for s in samples) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there are existing patterns for marking slow/network tests
rg -n "pytest\.mark\.(slow|network|integration)" tests/ || echo "No existing slow/network markers found"Repository: NVIDIA/Model-Optimizer
Length of output: 104
🏁 Script executed:
# First, let's look at the full test file to understand the context
wc -l tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 114
🏁 Script executed:
# Check test file structure and see all tests
cat -n tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5341
🏁 Script executed:
# Check if there are integration test directories as mentioned in guidelines
ls -la tests/ | grep -E "^d" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 524
🏁 Script executed:
# Search for any existing tests that use snapshot_download or HF hub
rg -n "snapshot_download|load_dataset" tests/Repository: NVIDIA/Model-Optimizer
Length of output: 622
🏁 Script executed:
# Search for any test mentioning "unregistered" or testing unknown datasets
rg -n "unregistered|unknown.*dataset|test.*dataset.*path" tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 128
🏁 Script executed:
# Look at the get_dataset_samples implementation to understand what it does
fd -n "dataset_utils.py" src/ --exec cat -n {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 296
🏁 Script executed:
# Search more broadly for dataset_utils implementation
fd -n "dataset_utils.py" --type fRepository: NVIDIA/Model-Optimizer
Length of output: 296
🏁 Script executed:
# Find dataset_utils.py correctly
fd "dataset_utils\.py" --type fRepository: NVIDIA/Model-Optimizer
Length of output: 286
🏁 Script executed:
# Once we find it, look at the implementation
find . -name "dataset_utils.py" -type f | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 105
🏁 Script executed:
# Check the size and read the dataset_utils.py implementation
wc -l modelopt/torch/utils/dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 107
🏁 Script executed:
# Read the get_dataset_samples function
cat -n modelopt/torch/utils/dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 30590
🏁 Script executed:
# Search for any existing tests of unregistered datasets
rg -A 10 "unregistered|SUPPORTED_DATASET_CONFIG" tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 474
Move network-dependent test to integration tests or mock external dependencies.
This test downloads from HuggingFace Hub via snapshot_download, requiring network connectivity and violating the coding guideline that tests/unit/**/*.py must be fast CPU-based tests. The test belongs in tests/examples (integration tests) or should mock snapshot_download to avoid network calls.
Additionally, add a test for unregistered local dataset paths (where the basename does not match any registered key in SUPPORTED_DATASET_CONFIG). Currently, only the registered minipile_100_samples path is tested, leaving the auto-detection code path (lines 294–302 of dataset_utils.py) untested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 86 - 108, The unit
test test_get_dataset_samples_minipile_local_path currently performs a network
call via snapshot_download and belongs in integration tests; change it to either
move the test to tests/examples (integration) or mock
huggingface_hub.snapshot_download so the unit test is fast and CPU-only, keeping
the call to get_dataset_samples intact. Also add a new unit test that creates a
local directory whose basename does NOT match any key in
SUPPORTED_DATASET_CONFIG and calls get_dataset_samples with that local path to
exercise the auto-detection branch in dataset_utils.get_dataset_samples (the
code that handles unregistered local dataset paths), asserting it returns
expected text samples.
There was a problem hiding this comment.
@AAnoosheh , so this test likely needs to be moved to gpu_test. I think the basic torch test does not have datasets installed
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
- Add minipile_100_samples to SUPPORTED_DATASET_CONFIG with text column preprocessor - Add unit test covering local path + substring matching via snapshot_download Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ed58dde to
d8b2d4e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unit/torch/utils/test_dataset_utils.py (1)
86-108:⚠️ Potential issue | 🟠 MajorThis unit test is still network-bound and won't catch the resolver bug.
snapshot_download()makestests/unit/...depend on the Hub, and the assertions only check for five non-empty strings. Because both thepileandminipile_100_samplesconfigs currently preprocesssample["text"], this can still pass when the resolver selects the wrong key. Please stubdatasets.load_dataset/Hub calls and assert the resolved kwargs instead (for example, that no inheritedname="v1.0"is passed), then add a second case whose directory basename is not inSUPPORTED_DATASET_CONFIGso the unregistered local-path branch is exercised too.Suggested direction
-def test_get_dataset_samples_minipile_local_path(tmp_path): +def test_get_dataset_samples_minipile_local_path(monkeypatch, tmp_path): - from huggingface_hub import snapshot_download - local_dir = tmp_path / "nanotron" / "minipile_100_samples" - snapshot_download( - repo_id="nanotron/minipile_100_samples", - repo_type="dataset", - local_dir=str(local_dir), - ) + local_dir.mkdir(parents=True) + + captured: dict[str, object] = {} + + def fake_load_dataset(*, split, streaming, **kwargs): + captured.update({"split": split, "streaming": streaming, **kwargs}) + return iter([{"text": f"sample-{i}"} for i in range(5)]) + + import datasets + monkeypatch.setattr(datasets, "load_dataset", fake_load_dataset) samples = get_dataset_samples(str(local_dir), num_samples=5) - assert isinstance(samples, list) - assert len(samples) == 5 - assert all(isinstance(s, str) and len(s) > 0 for s in samples) + assert samples == [f"sample-{i}" for i in range(5)] + assert captured["path"] == str(local_dir) + assert "name" not in capturedAs per coding guidelines,
tests/unit/**/*.py: Unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 86 - 108, The test currently performs network I/O via snapshot_download and only asserts result strings; instead, mock/stub external Hub calls and datasets.load_dataset so the test runs offline and can assert how get_dataset_samples resolves kwargs: patch huggingface_hub.snapshot_download and datasets.load_dataset (or the internal call site used by get_dataset_samples) to return a fake local_dir and a fake dataset object, then call get_dataset_samples with that local path and assert the resolved kwargs passed into datasets.load_dataset (e.g., that no inherited name="v1.0" or incorrect config is forwarded and that the selected config equals the one matched from SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also add a second case where the directory basename is not present in SUPPORTED_DATASET_CONFIG to exercise the unregistered-local-path branch and assert datasets.load_dataset is called with the generic/local-path kwargs instead.modelopt/torch/utils/dataset_utils.py (1)
255-262:⚠️ Potential issue | 🟠 MajorKeep the full local path separate from the lookup key.
This block truncates a local path to
rstrip("/").split("/")[-1]and then resolves configs with substring matching. That breaks unregistered local datasets later on Lines 294-298 becauseload_dataset()receives only the basename, and it also makesminipile_100_samplesresolve to the earlierpileentry in the current config order. Preservedataset_pathas-is, derive a normalized basename for lookup, and match that key exactly.Suggested fix
- if os.path.exists(dataset_name): - # Local path - dataset_path = dataset_name - dataset_name = dataset_path.rstrip("/").split("/")[-1] - else: - dataset_path = None - - registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name), None) + dataset_path: str | None = None + dataset_key = dataset_name + if os.path.exists(dataset_name): + dataset_path = dataset_name + dataset_key = os.path.basename(os.path.normpath(dataset_path)) + + registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k == dataset_key), None) @@ - f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. " + f"Dataset '{dataset_key}' is not in SUPPORTED_DATASET_CONFIG. " "Auto-detecting format from column names." ) - config = {"path": dataset_name} + config = {"path": dataset_path or dataset_name} splits = _normalize_splits(split) if split is not None else ["train"] def _preprocess(sample: dict) -> str: - return _auto_preprocess_sample(sample, dataset_name, tokenizer) + return _auto_preprocess_sample(sample, dataset_key, tokenizer)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 262, The current code mutates dataset_name to the basename which loses the full local path and uses substring matching against SUPPORTED_DATASET_CONFIG; revert this by keeping dataset_path set to the original full path (do not overwrite dataset_name), derive a separate normalized basename (e.g., use os.path.basename(dataset_name.rstrip("/"))) for lookup, and compute registered_key by exact equality against entries in SUPPORTED_DATASET_CONFIG using that basename (not substring containment); ensure later calls such as load_dataset() are passed dataset_path when it exists (the full local path) and only use registered_key/basename for config lookup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 231-234: The docstring for the dataset_name parameter overstates
supported inputs (mentions .jsonl.gz and uses "mmlu" as an example) while the
implementation only special-cases .jsonl and SUPPORTED_DATASET_CONFIG does not
contain "mmlu"; update the docstring to accurately reflect current behavior by
removing .jsonl.gz from the supported list and replacing the "mmlu" example with
an actual registered key from SUPPORTED_DATASET_CONFIG (or a generic phrase like
"one of the keys in SUPPORTED_DATASET_CONFIG"); ensure you update the docstring
near the dataset_name parameter and reference SUPPORTED_DATASET_CONFIG so docs
and code match.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-262: The current code mutates dataset_name to the basename
which loses the full local path and uses substring matching against
SUPPORTED_DATASET_CONFIG; revert this by keeping dataset_path set to the
original full path (do not overwrite dataset_name), derive a separate normalized
basename (e.g., use os.path.basename(dataset_name.rstrip("/"))) for lookup, and
compute registered_key by exact equality against entries in
SUPPORTED_DATASET_CONFIG using that basename (not substring containment); ensure
later calls such as load_dataset() are passed dataset_path when it exists (the
full local path) and only use registered_key/basename for config lookup.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The test currently performs network I/O via
snapshot_download and only asserts result strings; instead, mock/stub external
Hub calls and datasets.load_dataset so the test runs offline and can assert how
get_dataset_samples resolves kwargs: patch huggingface_hub.snapshot_download and
datasets.load_dataset (or the internal call site used by get_dataset_samples) to
return a fake local_dir and a fake dataset object, then call get_dataset_samples
with that local path and assert the resolved kwargs passed into
datasets.load_dataset (e.g., that no inherited name="v1.0" or incorrect config
is forwarded and that the selected config equals the one matched from
SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also add a second case
where the directory basename is not present in SUPPORTED_DATASET_CONFIG to
exercise the unregistered-local-path branch and assert datasets.load_dataset is
called with the generic/local-path kwargs instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3c8d7f2-c997-4ae2-ae2e-3bfadaf8d90d
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
01c2508 to
63f9c7d
Compare
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Chores
Tests
Documentation