FEAT: CBT-Bench Dataset by warisgill · Pull Request #1411 · Azure/PyRIT

warisgill · 2026-02-27T07:20:46Z

Integrate the CBT-Bench psychotherapy benchmark dataset from HuggingFace
(Psychotherapy-LLM/CBT-Bench) into PyRIT.

Changes

New file: pyrit/datasets/seed_datasets/remote/cbt_bench_dataset.py — _CBTBenchDataset loader
Modified: pyrit/datasets/seed_datasets/remote/__init__.py — registered the new loader
New file: tests/unit/datasets/test_cbt_bench_dataset.py — 7 unit tests

Key Design Decisions

Prompt value combines situation + thoughts (per reviewer feedback on FEAT: CBT-Bench Dataset #888)
Harm categories set to ["psycho-social harms"] (per @jbolor21 and @romanlutz discussion on FEAT: CBT-Bench Dataset #888)
Default config is core_fine_seed but supports all 39 HuggingFace subsets via config parameter
core_belief_fine_grained stored in metadata for downstream use

Closes #865 (supersedes stale PR #888)

@romanlutz

Tests and Documentation

Tests: 7 new unit tests added in tests/unit/datasets/test_cbt_bench_dataset.py. All tests mock _fetch_from_huggingface (no network calls). All 98
dataset unit tests pass (91 existing + 7 new). Ruff lint clean.
Documentation: No documentation changes needed. Dataset classes are internal (_ prefixed) and auto-register via SeedDatasetProvider.__init_subclass__.
api.rst only lists SeedDatasetProvider, not individual dataset loaders.
JupyText: Not applicable — no notebook or API documentation changes.

Integrate the CBT-Bench psychotherapy benchmark dataset from HuggingFace (Psychotherapy-LLM/CBT-Bench) into PyRIT.

warisgill · 2026-02-27T07:23:40Z

@microsoft-github-policy-service agree company="Microsoft"

Copilot

Pull request overview

This pull request integrates the CBT-Bench (Cognitive Behavioral Therapy benchmark) dataset from HuggingFace into PyRIT, enabling evaluation of LLM safety and alignment in psychotherapy contexts. The implementation supersedes stale PR #888 and addresses issue #865.

Changes:

Added a new remote dataset loader for CBT-Bench with support for 39 HuggingFace subsets
Registered the new loader in the remote datasets init file
Added comprehensive unit tests with 7 test cases covering initialization, fetching, edge cases, and metadata validation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`pyrit/datasets/seed_datasets/remote/cbt_bench_dataset.py`	New dataset loader class implementing fetch logic, combining situation and thoughts into prompt value, storing core beliefs in metadata
`pyrit/datasets/seed_datasets/remote/__init__.py`	Registered _CBTBenchDataset in imports and all list following alphabetical ordering
`tests/unit/datasets/test_cbt_bench_dataset.py`	Unit tests with fixtures and mocking covering normal operation, custom configs, edge cases, and metadata validation

Comments suppressed due to low confidence (1)

pyrit/datasets/seed_datasets/remote/cbt_bench_dataset.py:117

The metadata field for core_belief_fine_grained is being set to a list, but the Seed.metadata field is typed as dict[str, Union[str, int]]. This creates a type mismatch. To fix this, either convert the list to a string (e.g., JSON string or comma-separated) before storing in metadata, or use a local variable annotation like other datasets do (from typing import Any; metadata: dict[str, Any] = {...}).

                metadata["core_belief_fine_grained"] = core_beliefs

romanlutz

Looks good! Did you run the integration test that fetches the dataset?

I also want @jbolor21 to take a look as this was their feature request.

romanlutz · 2026-02-27T12:14:55Z

Oh and we need to rerun the notebook that lists all the datasets. It's the one in doc/code/datasets/ with index 0 I think. It should add CBT bench as a new line after execution.

warisgill · 2026-02-27T20:29:29Z

@romanlutz Thank you for the feedback! I've addressed both items:

Ready for review!

doc/code/datasets/1_loading_datasets.ipynb

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (3)

tests/unit/datasets/test_cbt_bench_dataset.py:116

The loader has a distinct branch where thoughts is present but situation is empty (it sets value = thoughts). There’s currently no unit test covering that path, while the other branches are covered (both fields, situation-only, neither). Add a test case with situation="" and non-empty thoughts to ensure this behavior doesn’t regress.

    async def test_fetch_dataset_situation_only(self, mock_cbt_bench_data_missing_thoughts):
        """Test that items with only situation (no thoughts) still work."""
        loader = _CBTBenchDataset()

        with patch.object(loader, "_fetch_from_huggingface", return_value=mock_cbt_bench_data_missing_thoughts):
            dataset = await loader.fetch_dataset()

doc/code/datasets/1_loading_datasets.ipynb:30

This notebook output includes an environment-specific warning and a local filesystem path (e.g., /mnt/c/Users/.../.venv/...). Clear cell outputs before committing to avoid leaking local paths and to keep docs deterministic.

    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/mnt/c/Users/warisgill/Documents/PyRIT/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },

doc/code/datasets/1_loading_datasets.ipynb:52

PR description says no documentation changes are needed, but this PR changes a documentation notebook (adds stderr output and updates the dataset list output). Either update the PR description to mention the doc change, or revert the notebook changes (preferably by clearing outputs and keeping only intentional content updates).

       " 'airt_violence',\n",
       " 'aya_redteaming',\n",
       " 'babelscape_alert',\n",
       " 'cbt_bench',\n",
       " 'ccp_sensitive_prompts',\n",

You can also share your feedback on Copilot code review. Take the survey.

Regenerate notebook outputs to include cbt_bench in the dataset list and sanitize user-specific paths from cell outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FEAT: CBT-Bench Dataset

fdf1627

Integrate the CBT-Bench psychotherapy benchmark dataset from HuggingFace (Psychotherapy-LLM/CBT-Bench) into PyRIT.

Copilot AI review requested due to automatic review settings February 27, 2026 07:20

Copilot started reviewing on behalf of warisgill February 27, 2026 07:21 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

romanlutz approved these changes Feb 27, 2026

View reviewed changes

Waris added 3 commits February 27, 2026 10:47

integration & notebook regneration

fd43440

integration & notebook regneration

c87842f

integration & notebook regneration, formatting issue is also fixed

2f8f0b4

anandansundar reviewed Feb 27, 2026

View reviewed changes

doc/code/datasets/1_loading_datasets.ipynb Outdated Show resolved Hide resolved

romanlutz assigned jbolor21 Mar 1, 2026

Merge branch 'main' into feat/cbt-bench-dataset

512c637

Copilot AI review requested due to automatic review settings March 4, 2026 05:59

Copilot started reviewing on behalf of romanlutz March 4, 2026 05:59 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Rerun 1_loading_datasets notebook with clean outputs

6ea80bc

Regenerate notebook outputs to include cbt_bench in the dataset list and sanitize user-specific paths from cell outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: CBT-Bench Dataset#1411

FEAT: CBT-Bench Dataset#1411
warisgill wants to merge 6 commits intoAzure:mainfrom
warisgill:feat/cbt-bench-dataset

warisgill commented Feb 27, 2026

Uh oh!

warisgill commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

romanlutz left a comment

Uh oh!

romanlutz commented Feb 27, 2026

Uh oh!

warisgill commented Feb 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

warisgill commented Feb 27, 2026

Changes

Key Design Decisions

Tests and Documentation

Uh oh!

warisgill commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

romanlutz commented Feb 27, 2026

Uh oh!

warisgill commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

warisgill commented Feb 27, 2026 •

edited

Loading