Skip to content

Add doctests to functions to help humans and agents#1400

Draft
ntjohnson1 wants to merge 5 commits intoapache:mainfrom
rerun-io:nick/doctests
Draft

Add doctests to functions to help humans and agents#1400
ntjohnson1 wants to merge 5 commits intoapache:mainfrom
rerun-io:nick/doctests

Conversation

@ntjohnson1
Copy link
Contributor

@ntjohnson1 ntjohnson1 commented Feb 26, 2026

Which issue does this PR close?

Related to #1394
But I can file another issue.
I generated some examples manually then asked Claude to extend the pattern.
I've looked over the examples and they all seem reasonably minimal and get tested for correctness by doctest.

Rationale for this change

Generally when I am using the dataframe API the docstring/definition alone isn't always sufficient to highlight how or when to use all of the different options. I often find myself generating a small standalone example to verify..

Related to agents they do a great job interpolating so have more small standalone examples for various functionality should help them but I don't have a reference for that.

What changes are included in this PR?

  • Turns on doctest so pytest verifies docstring examples
  • Minor update to the test.yml to fix test discovery with doctest turned on, and fixed a github lint that triggered for me since I touched that file
  • Fixes a few existing examples to be compliant
  • Generates examples for everything in functions

Are there any user-facing changes?

No

>>> builtins.round(
... result.collect_column("rad")[0].as_py(), 6
... )
3.141593
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like the builtins round we can just use the doctest ellipses approach instead.

Comment on lines +1473 to +1474
>>> import datafusion as dfn
>>> ctx = dfn.SessionContext()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doctests allows specifying a prefix to always run if the import and ctx feel like too much bloat.

@ntjohnson1 ntjohnson1 marked this pull request as ready for review February 26, 2026 16:38
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this @ntjohnson1
This PR is massive and so I think it will take time to review and get through

Would you willing to break it up into a set of smaller PRs? For example:

  1. A PR with an initial example and the infrastructure needed to test the examples
  2. A bunch of PRs, each having a few (5?) functions that can be individually reviewed and verified?

with:
path: ~/.cargo
key: cargo-cache-${{ steps.rust-toolchain.outputs.cachekey }}-${{ hashFiles('Cargo.lock') }}
key: cargo-cache-${{ matrix.toolchain }}-${{ hashFiles('Cargo.lock') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this was already broken but my change here invalidated the cache and exercised the fact that this line is no longer valid. This was required to make CI, happy.

[tool.pytest.ini_options]
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
addopts = "--doctest-modules"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this here? (sorry I don't know python enough)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the examples in the doc strings aren't executed. There are a few different ways to turn that functionality on and this seemed the least intrusive. I broke down my commits somewhat (besides the giant here are loads of examples). So the first commit allows testing of the examples in the doc strings. Then I ran pytest (which now automatically runs the tests in the doc strings) and fixed the few cases we already had.

@ntjohnson1
Copy link
Contributor Author

Would you willing to break it up into a set of smaller PRs? For example:

  1. A PR with an initial example and the infrastructure needed to test the examples
  2. A bunch of PRs, each having a few (5?) functions that can be individually reviewed and verified?

Yep, happy to break it up. 1 is basically commits 1,2 and 4 from here. Then commits 3+5 can be split up. The first one I can probably do today and the follow up splitting hopefully this weekend.

@ntjohnson1
Copy link
Contributor Author

Moving this to a draft as unified example of the end state (at least for functions) as I split up the PR.

@ntjohnson1 ntjohnson1 marked this pull request as draft March 4, 2026 20:20
@alamb
Copy link
Contributor

alamb commented Mar 4, 2026

Thank you @ntjohnson1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants