Skip to content

TLP test fail on sqlancer due to num_nulls interpreted differently on workers #8468

@alperkocatas

Description

@alperkocatas

sqlancer found the following error, where a query rewritten based on TLP (Ternary Logic Partitioning) returns different cardinalities. It looks like the problem is caused by the condition (num_nulls(t1.c0) IS DISTINCT FROM t1.c0) being interpreted differently when they are pushed down to workers.

We have the following distributed tables:

...
SELECT create_distributed_table('t1', 'c0');

SELECT create_distributed_table('t2', 'c1');

SELECT create_distributed_table('t4', 'c0');

SELECT create_distributed_table('t5', 'c0');
...

I tried to simplify the original sqlancer query as a smaller one which can still reproduce the error:

Query 1:

    SELECT t5.c0

    FROM t1*, t0

    LEFT OUTER JOIN t5* ON (True)

-- cardinality: 52

Query 2

    SELECT ALL t5.c0

    FROM t1, t0

    LEFT OUTER JOIN t5* ON (True)

    WHERE (num_nulls(t1.c0) IS DISTINCT FROM t1.c0)

  

    UNION ALL

  

    -- PART 2: WHERE condition IS FALSE (NOT condition)

    SELECT ALL t5.c0

    FROM t1*, t0*

    LEFT OUTER JOIN t5 ON (True)

    WHERE NOT (num_nulls(t1.c0) IS DISTINCT FROM t1.c0)


    UNION ALL

  
    -- PART 3: WHERE condition IS NULL

    SELECT ALL t5.c0

    FROM t1*, t0

    LEFT OUTER JOIN t5* ON (True)

    WHERE (num_nulls(t1.c0) IS DISTINCT FROM t1.c0) IS NULL;

-- cardinality: 44

-- ========================================

-- TLP EXPLANATION:

-- The TLP technique transforms a query Q into Q₁ ∪ Q₂ ∪ Q₃ where:

-- Q₁: SELECT ... WHERE condition IS TRUE  (condition evaluates to TRUE)

-- Q₂: SELECT ... WHERE condition IS FALSE (condition evaluates to FALSE)

-- Q₃: SELECT ... WHERE condition IS NULL  (condition evaluates to NULL)

--

-- In SQL's three-valued logic, every condition can evaluate to TRUE, FALSE, or NULL.

-- The union of these three partitions should be equivalent to the original query.

-- This is a fundamental property used by SQLancer to detect query processing bugs.

--

-- BUG DETECTED: Original query returns 52 rows, TLP query returns 44 rows!

-- This might indicate a bug in Citus's distributed query processing.

-- ========================================

Please see the attached file database5.log for a full log of sqlancer, which can be used to reproduce the case.

Note: If we comment out the create_distribute_table commands, both queries return 52 rows. So, this works in vanilla postgres.

Postgres version 17.7, citus version:

test=# \dx
                 List of installed extensions
  Name   | Version |   Schema   |         Description
---------+---------+------------+------------------------------
 citus   | 14.0-1  | pg_catalog | Citus distributed database
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

update: the error also exists in citus 13.2-1 and 13.1.1

database5.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions