Skip to content

Improvements to partition export#1402

Open
arthurpassos wants to merge 24 commits intoantalya-26.1from
26_1_export_improvements_test
Open

Improvements to partition export#1402
arthurpassos wants to merge 24 commits intoantalya-26.1from
26_1_export_improvements_test

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Feb 13, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Forward port of unmerged PR #1177

Documentation entry for user-facing changes

As of now, this PR does the following things:

  1. Limit the amount of export part operations that can be scheduled based on the BackgroundMovesExecutor. This applies both to partition and part exports. It is no longer memory bound, it solves the problem of one replica locking all parts in a task just because it ran faster even if it did not have bandwidth to execute all of them.
  2. Introduce ZooKeeper export partition requests specific metrics (I suppose we'll remove this later, for now it is good to have to be able to benchmark different approaches).
  3. Introduce lock a data part inside the task strategy as opposed to locking and only then scheduling a task. This is controlled by export_merge_tree_partition_lock_inside_the_task. I don't think we want users to do it, it is to experiment and benchmark. See ExportPartFromPartitionExportTask
  4. Only run the scheduler if we have available slots. And only schedule as many as we can (based on slots). This is subject to TOCTOU. The background executor has an internal pending queue, so even if we have available slots a task might end up on pending. To tackle this, we would need to write a new background executor, but not a must for now.
  5. Add local system.replicated_partition_exports option. Refactor querying system.replicated_partition_exports to use multi_read requests instead of several different read requests. Throws iff multi_read is not supported (this was vibe coded).
  6. Shuffle parts to export in a given partition task before choosing a part to work on to avoid locking collisions.
  7. Save the entire Settings object instead of only FormatSettings - now more settings should be preserved (part export only)
  8. Clear part references in partition export manifest once it is no longer pending

lock_inside_the_task is not production ready as of now (the entire feature is not production ready tbh) - there is a possible crash in case the user schedules an export, changes the schema of the destination table, and the export executes. This is because validation is not being done on the fly for this setting. But I think it is ok to ignore this corner case for now.

This partially tackles the following:

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@arthurpassos arthurpassos changed the base branch from fp_antalya_26_1_export_part_partition to antalya-26.1 March 4, 2026 16:00
@arthurpassos arthurpassos changed the title 26 1 export improvements test Improvements to partition export Mar 4, 2026
@arthurpassos arthurpassos marked this pull request as ready for review March 4, 2026 16:07
@arthurpassos
Copy link
Collaborator Author

@codex please review it deeply

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Workflow [PR], commit [1e85917]

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fad4e2402

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

/// what to do if it's not a valid value?
}

manifest.lock_inside_the_task = json->getValue<bool>("lock_inside_the_task");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep manifest parsing backward-compatible

Make lock_inside_the_task optional when deserializing metadata. Existing metadata.json znodes created before this change do not contain that key, so json->getValue<bool>("lock_inside_the_task") throws during polling/system-table reads and can prevent processing of in-flight exports after upgrade until those znodes are manually cleaned up.

Useful? React with 👍 / 👎.

Comment on lines +40 to +41
LOG_INFO(storage.log, "ExportPartFromPartitionExportTask: Failed to lock part {}, skipping", part_name);
return false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove export manifest when lock acquisition fails

Clean up the pre-inserted export_manifests entry on the lock-failure path. In lock_inside_the_task mode, the scheduler inserts the manifest before execution, but if tryCreate(.../locks/<part>) fails here, the task exits without erasing that manifest, leaving the part permanently marked as already-exporting on this replica and blocking retries if the current lock owner dies before finishing.

Useful? React with 👍 / 👎.

Comment on lines +420 to +421
info.transaction_id = metadata.transaction_id;
info.create_time = metadata.create_time;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate query_id in partition export system rows

Assign info.query_id when building ReplicatedPartitionExportInfo. This method now fills many fields from manifest metadata but never sets query_id, so system.replicated_partition_exports.query_id is always empty even for active exports, which is a regression from the previous implementation and breaks query-to-export correlation.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants