GH-43695: [C++][Parquet] flatbuffers metadata integration by Jiayi-Wang-db · Pull Request #48431 · apache/arrow

Jiayi-Wang-db · 2025-12-10T16:29:43Z

Rationale for this change

Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)

What changes are included in this PR?

Definition of the FlatBuffer footer and the generated FlatBuffer file
To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer
Append/Extract FlatBuffer to/from the extension field of the Thrift footer
Use append/extract operations based on reader/writer flags

Are these changes tested?

Yes, with newly added tests.

Are there any user-facing changes?

Yes, users can write and read the FlatBuffer footer to speed up footer parsing.

GitHub Issue: [C++][Parquet] Proof-of-concept: Trying to using FlatBuffer as Parquet Footer #43695

Jiayi-Wang-db · 2025-12-10T16:50:37Z

cpp/src/parquet/metadata3.cc

+  auto To(const format::ColumnMetaData& cm) {
+    if (!cm.encoding_stats.empty()) {
+      for (auto&& e : cm.encoding_stats) {
+        if (e.page_type != format::PageType::DATA_PAGE &&
+            e.page_type != format::PageType::DATA_PAGE_V2)
+          continue;
+        if (e.encoding != format::Encoding::PLAIN_DICTIONARY &&
+            e.encoding != format::Encoding::RLE_DICTIONARY) {
+          return false;
+        }
+      }
+      return true;
+    }
+    bool has_plain_dictionary_encoding = false;
+    bool has_non_dictionary_encoding = false;
+    for (auto encoding : cm.encodings) {
+      switch (encoding) {
+        case format::Encoding::PLAIN_DICTIONARY:
+          // PLAIN_DICTIONARY encoding was present, which means at
+          // least one page was dictionary encoded and v1.0 encodings are used.
+          has_plain_dictionary_encoding = true;
+          break;
+        case format::Encoding::RLE:
+        case format::Encoding::BIT_PACKED:
+          // Other than for boolean values, RLE and BIT_PACKED are only used for
+          // repetition or definition levels. Additionally booleans are not dictionary
+          // encoded hence it is safe to disregard the case where some boolean data pages
+          // are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded.
+          break;
+        default:
+          has_non_dictionary_encoding = true;
+          break;
+      }
+    }
+    if (has_plain_dictionary_encoding) {
+      // Return true, if there are no encodings other than dictionary or
+      // repetition/definition levels.
+      return !has_non_dictionary_encoding;
+    }
+
+    // If PLAIN_DICTIONARY wasn't present, then either the column is not
+    // dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used.
+    // For 2.0, this cannot determine whether a page fell back to non-dictionary encoding
+    // without page encoding stats.
+    return false;
+  }


This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.

Could you summarize the difference?

rok · 2025-12-11T18:49:34Z

Great to see things moving here!

### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431

rok · 2025-12-16T18:59:54Z

cpp/src/parquet/metadata3.h

+// Returns the size of the flatbuffer if found (and writes to out_flatbuffer),
+// returns 0 if no flatbuffer extension is present, or returns the required
+// buffer size if the input buffer is too small.
+::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);


Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?

Suggested change

::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);

::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);

rok · 2026-01-20T15:03:03Z

cpp/src/parquet/metadata3.h

+#include "arrow/result.h"
+#include "flatbuffers/flatbuffers.h"
+#include "generated/parquet3_generated.h"
+#include "generated/parquet_types.h"


Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.

static std::shared_ptr<FileMetaData> MakeFromFlatbuffer( const uint8_t* flatbuffer_data, size_t flatbuffer_size, uint32_t metadata_len, const ReaderProperties& properties = default_reader_properties());

Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.

alkis · 2026-01-21T18:31:59Z

FYI @emkornfield @prtkgaur if you want to take a look

emkornfield · 2026-01-21T23:31:48Z

cpp/src/parquet/file_reader.cc

+              format3::GetFileMetaData(flatbuffer_data.data());
+          auto thrift_metadata =
+              std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata));
+          file_metadata_ = FileMetaData::Make(


FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?

emkornfield · 2026-01-21T23:38:39Z

cpp/src/parquet/parquet3.fbs

@@ -0,0 +1,224 @@
+namespace parquet.format3;


I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.

emkornfield · 2026-01-21T23:39:36Z

cpp/src/parquet/properties.h

  void set_footer_read_size(size_t size) { footer_read_size_ = size; }
  size_t footer_read_size() const { return footer_read_size_; }

+  // If enabled, try to read the metadata3 footer from the file.


Suggested change

// If enabled, try to read the metadata3 footer from the file.

// If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file).

emkornfield · 2026-01-21T23:41:16Z

cpp/src/parquet/properties.h

+  // If it fails, fall back to Thrift footer decoding.
+  bool read_metadata3() const { return read_metadata3_; }
+  void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; }
+


I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.

emkornfield · 2026-01-21T23:41:56Z

cpp/src/parquet/properties.h


+  // If enabled, try to read the metadata3 footer from the file.
+  // If it fails, fall back to Thrift footer decoding.
+  bool read_metadata3() const { return read_metadata3_; }


Suggested change

bool read_metadata3() const { return read_metadata3_; }

bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; }

emkornfield · 2026-01-21T23:42:36Z

cpp/src/parquet/properties.h

  bool page_checksum_verification_ = false;
  // Used with a RecordReader.
  bool read_dense_for_nullable_ = false;
+  bool read_metadata3_ = false;


I think this should default to true? otherwise I worry about readers getting the benefit?

emkornfield · 2026-01-31T18:08:28Z

cpp/src/parquet/metadata3.cc

+  LZ4_RAW = 7,
+};
+
+auto GetNumChildren(


style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.

emkornfield · 2026-01-31T18:09:10Z

cpp/src/parquet/metadata3.cc

+
+auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; }
+
+class ColumnMap {


please add docs.

emkornfield · 2026-01-31T18:09:32Z

cpp/src/parquet/metadata3.cc

+    BuildParents(s);
+  }
+
+  size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; }


emkornfield · 2026-01-31T18:11:38Z

cpp/src/parquet/metadata3.cc

+  std::vector<uint32_t> parents_;
+};
+
+struct MinMax {


emkornfield · 2026-01-31T18:21:08Z

cpp/src/parquet/metadata3.cc

+  uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1;
+
+  // Compute and store checksums and lengths
+  uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1);


Is this format documented (I might have missed it in the parquet-format pull request).

emkornfield · 2026-01-31T18:23:07Z

cpp/src/parquet/metadata3.cc

+  } while (true);
+}
+
+inline uint32_t CountLeadingZeros32(uint32_t v) {


existing util

emkornfield · 2026-01-31T18:23:45Z

cpp/src/parquet/metadata3.cc

+  return out;
+}
+
+inline uint8_t* WriteULEB64(uint64_t v, uint8_t* out) {


we should have something like this for delta binary packed, which uses uleb as well, could you look there?

emkornfield · 2026-01-31T18:28:57Z

cpp/src/parquet/metadata3.h

+// The extension itself is as follows:
+//
+// +-------------------+------------+--------------------------------------+----------------+---------+--------------------------------+------+
+// | compress(flatbuf) | compressor | crc(compress(flatbuf) .. compressor) | compressed_len | raw_len | crc(compressed_len .. raw_len) | UUID |


This should be documented in the parquet-format PR.

… instead of relative ones. Implement Geo types. Add Float16 and Variant. Pack statistics better

Jiayi-Wang-db requested a review from wgtmac as a code owner December 10, 2025 16:29

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 10, 2025

Jiayi-Wang-db commented Dec 10, 2025

View reviewed changes

alkis mentioned this pull request Dec 10, 2025

GH-43695: [C++][Parquet] flatbuffers metadata experiments #43793

Closed

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 11, 2025

alkis mentioned this pull request Dec 12, 2025

GH-531: Link footer proposal alkis/parquet-format#1

Closed

This was referenced Dec 12, 2025

GH-531: Link footer proposal apache/parquet-format#543

Merged

GH-531: Add parquet flatbuf schema apache/parquet-format#544

Open

rok reviewed Dec 16, 2025

View reviewed changes

This was referenced Dec 24, 2025

Parquet metadata as flatbuffers apache/arrow-rs#9041

Open

parquet: use flatbuffers to store metadata (WIP) apache/arrow-rs#9042

Draft

rok reviewed Jan 20, 2026

View reviewed changes

emkornfield reviewed Jan 21, 2026

View reviewed changes

emkornfield reviewed Jan 31, 2026

View reviewed changes

cpp/src/parquet/metadata3.cc

std::vector<uint32_t> parents_;

};

struct MinMax {

Copy link

Contributor

emkornfield Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs.

emkornfield reviewed Jan 31, 2026

View reviewed changes

cpp/src/parquet/metadata3.cc Outdated

} while (true);

}

inline uint32_t CountLeadingZeros32(uint32_t v) {

Copy link

Contributor

emkornfield Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing util

emkornfield reviewed Jan 31, 2026

View reviewed changes

rok force-pushed the flatbuf3 branch 3 times, most recently from eccaba1 to 7d5b33b Compare March 5, 2026 22:38

alkis and others added 17 commits March 6, 2026 00:06

1/n flatbuf footer

cb0d8f9

2/n remove deprecated ColumnChunk.file_offset

64710e3

3/n optimize statistics

2e2225f

4/n optimize offsets, num_values, and byte counts

0adac0f

5/n if column chunk is dense do not encode num_values

54a9907

6/n remove encoding_stats

28d7de8

7/n remove path_in_schema

b83836d

8/n remove encodings

0317367

9/n optimize statistics take 2

b45c8e2

Update to latest flatbuf definition. Store 64-bit offsets and lengths…

424ccb9

… instead of relative ones. Implement Geo types. Add Float16 and Variant. Pack statistics better

fix benchmark build

af999b4

split code

dbdb083

Add reader and writer flag and embed metadata3 into thrift

acc9ffb

test

fd23273

reduce duplicate code

2973044

Fix is_fully_dict_encoded

906ee17

fix build

7704cf3

rok force-pushed the flatbuf3 branch 4 times, most recently from 8c94e80 to 72ac531 Compare March 6, 2026 00:09

Some review feedback

501180a

rok force-pushed the flatbuf3 branch from 72ac531 to 501180a Compare March 6, 2026 15:55

	::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);
	::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);

	// If enabled, try to read the metadata3 footer from the file.
	// If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file).

	bool read_metadata3() const { return read_metadata3_; }
	bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; }


		auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; }

		class ColumnMap {

Conversation

Jiayi-Wang-db commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alkis commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jiayi-Wang-db commented Dec 10, 2025 •

edited

Loading

rok Jan 20, 2026 •

edited

Loading