Skip to content

fix: Optimize !~ '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq ""#20702

Open
petern48 wants to merge 6 commits intoapache:mainfrom
petern48:bug_regexp_optim
Open

fix: Optimize !~ '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq ""#20702
petern48 wants to merge 6 commits intoapache:mainfrom
petern48:bug_regexp_optim

Conversation

@petern48
Copy link
Contributor

@petern48 petern48 commented Mar 4, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

A pre-existing optimization rule for the !~ .* (regexp not match) case rewrote the plan to Eq "", which would return empty strings as part of the result. This is incorrect and doesn't match the output without the optimization rule.

Instead, this PR rewrites the plan to simply col IS NOT NULL AND Boolean(NULL) or, in other words, "NULL if col is NULL else false."

I've confirmed this behavior matches the result of running queries manually with the optimization rule turned off.

Are these changes tested?

Fixed expected output in tests. Added new tests for nulls

Are there any user-facing changes?

Yes, a minor bug fix. When querying s !~ .*, empty strings will no longer be included in the result which is consistent with the behavior without the optimization rule.

@petern48 petern48 changed the title Fix: Optimize ~! '.*' and '.*' cases to False instead of Eq "" fix/perf: Optimize ~! '.*' case to False instead of Eq "" Mar 4, 2026
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 4, 2026
@petern48 petern48 marked this pull request as ready for review March 4, 2026 18:28
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @petern48

/// - combinations (alternatives) of the above, will be concatenated with `OR` or `AND`
/// - `EQ .*` to NotNull
/// - `NE .*` means IS EMPTY
/// - `NE .*` to false (.* matches non-empty and empty strings, and NULL !~ '.*' results in NULL so this can never be true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only false in the context of filters (and this code can currently be used for both filters and regular expressions)

I think null !~ '.*' is actually null (not false)

I think a correct rewrite is

CASE 
  WHEN col IS NOT NULL THEN FALSE 
  ELSE NULL
END

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks, i didn't realize this code applied to cases outside of filters. I've fixed it and updated the PR title and description.

note: the optimizer would further rewrite the CASE expr into IS NOT NULL AND NULL when I ran the tests, so I rewrote it directly to that instead

)?;

// Test `!= ".*"` transforms to checking if the column is empty
// Test `!~ ".*"` transforms to false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please also add a test explicitly for a null input?

@petern48 petern48 changed the title fix/perf: Optimize ~! '.*' case to False instead of Eq "" fix: Optimize ~! '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" Mar 5, 2026
@petern48 petern48 changed the title fix: Optimize ~! '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" fix: Optimize !~ '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" Mar 5, 2026
@alamb alamb added the bug Something isn't working label Mar 5, 2026
left,
op: Operator::Eq,
right: empty_lit,
left: Box::new(left.is_not_null()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case should be is_null rather than is_not_null

I think this because when left is null, The expression null !~ '.*' should also evaluate to null

However, as written (col IS NOT NULL AND NULL) will evaluate

col IS NOT NULL AND NULL
-->  FALSE AND NULL
-->  FALSE 

If the transformation is col IS NULL AND NULL 🤯 then:

col IS NULL AND NULL
-->  true AND NULL
-->  NULL 

As expected

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @petern48

----
logical_plan
01)Filter: t.b = Utf8View("")
01)Filter: t.b IS NOT NULL AND Boolean(NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test that actually executes col !~ .* for a column that has 'foo', '', and null

Specifically like this (the ::text is needed for postgres):

-- Reproducer for regex simplification of `col !~ '.*'`
-- Input values: 'foo', '', NULL

WITH t(col) AS (
    VALUES
      ('foo'::text),
      (''::text),
      (NULL::text)
  )
  SELECT
    col,
    col !~ '.*' AS not_match_dot_star
  FROM t;

Should result in

+------+-----------------+
| a    | baseline_result |
+------+-----------------+
| foo  | false           |
|      | false           |
| NULL | NULL            |
+------+-----------------+

Here is what current datafusion does

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ datafusion-cli
DataFusion CLI v52.1.0
> WITH t(col) AS (
    VALUES
      ('foo'::text),
      (''::text),
      (NULL::text)
  )
  SELECT
    col,
    col !~ '.*' AS not_match_dot_star
  FROM t;
+------+--------------------+
| col  | not_match_dot_star |
+------+--------------------+
| foo  | false              |
|      | true               |
| NULL | NULL               |
+------+--------------------+
3 row(s) fetched.
Elapsed 0.038 seconds.

Here is what postgres says

postgres=#   WITH t(col) AS (
    VALUES
      ('foo'::text),
      (''::text),
      (NULL::text)
  )
  SELECT
    col,
    col !~ '.*' AS not_match_dot_star
  FROM t;
 col | not_match_dot_star
-----+--------------------
 foo | f
     | f
     |
(3 rows)

I think this branch will not get the right value for col IS NULL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: regexp simplify optimization incorrect simplifies .* pattern to Eq "" operation

2 participants