Skip to content

feat: Vectorize aggregating Statistics#20768

Open
jonathanc-n wants to merge 3 commits intoapache:mainfrom
jonathanc-n:speed-up-stats-aggregation
Open

feat: Vectorize aggregating Statistics#20768
jonathanc-n wants to merge 3 commits intoapache:mainfrom
jonathanc-n:speed-up-stats-aggregation

Conversation

@jonathanc-n
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Vectorize aggregations for combining statistics by gathering all values then calling kernels once

Are these changes tested?

Unit tests + existing tests

Are there any user-facing changes?

Removed merge_iter

@github-actions github-actions bot added the common Related to common crate label Mar 7, 2026
/// assert_eq!(merged.column_statistics[0].sum_value,
/// Precision::Exact(ScalarValue::from(1500)));
/// ```
pub fn try_merge(self, other: &Statistics) -> Result<Self> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this removal should be fine since most API calls would've been through try_merge_iter, should be mentioned in the upgrade guide though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this instead of importing aggregate common functions because there would be circular dependency. Much of it is just duplicate code

@jonathanc-n
Copy link
Contributor Author

I verified this has a 5x speed up for numeric primitive values using small benchmark. felt unnecssary to add the benchmark since it is jsut a regular vectorization optimization

}
}

/// Compute the sum of a collection of [`ScalarValue`]s using vectorized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really faster than directly summing the primitives values out of the scalarvalues (without creating scalarvalue)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Merging Statistics is slow when sum statistic is present

2 participants