GH-43695: [C++][Parquet] flatbuffers metadata integration#48431
GH-43695: [C++][Parquet] flatbuffers metadata integration#48431Jiayi-Wang-db wants to merge 18 commits intoapache:mainfrom
Conversation
| auto To(const format::ColumnMetaData& cm) { | ||
| if (!cm.encoding_stats.empty()) { | ||
| for (auto&& e : cm.encoding_stats) { | ||
| if (e.page_type != format::PageType::DATA_PAGE && | ||
| e.page_type != format::PageType::DATA_PAGE_V2) | ||
| continue; | ||
| if (e.encoding != format::Encoding::PLAIN_DICTIONARY && | ||
| e.encoding != format::Encoding::RLE_DICTIONARY) { | ||
| return false; | ||
| } | ||
| } | ||
| return true; | ||
| } | ||
| bool has_plain_dictionary_encoding = false; | ||
| bool has_non_dictionary_encoding = false; | ||
| for (auto encoding : cm.encodings) { | ||
| switch (encoding) { | ||
| case format::Encoding::PLAIN_DICTIONARY: | ||
| // PLAIN_DICTIONARY encoding was present, which means at | ||
| // least one page was dictionary encoded and v1.0 encodings are used. | ||
| has_plain_dictionary_encoding = true; | ||
| break; | ||
| case format::Encoding::RLE: | ||
| case format::Encoding::BIT_PACKED: | ||
| // Other than for boolean values, RLE and BIT_PACKED are only used for | ||
| // repetition or definition levels. Additionally booleans are not dictionary | ||
| // encoded hence it is safe to disregard the case where some boolean data pages | ||
| // are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded. | ||
| break; | ||
| default: | ||
| has_non_dictionary_encoding = true; | ||
| break; | ||
| } | ||
| } | ||
| if (has_plain_dictionary_encoding) { | ||
| // Return true, if there are no encodings other than dictionary or | ||
| // repetition/definition levels. | ||
| return !has_non_dictionary_encoding; | ||
| } | ||
|
|
||
| // If PLAIN_DICTIONARY wasn't present, then either the column is not | ||
| // dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used. | ||
| // For 2.0, this cannot determine whether a page fell back to non-dictionary encoding | ||
| // without page encoding stats. | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.
There was a problem hiding this comment.
Could you summarize the difference?
|
Great to see things moving here! |
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
cpp/src/parquet/metadata3.h
Outdated
| // Returns the size of the flatbuffer if found (and writes to out_flatbuffer), | ||
| // returns 0 if no flatbuffer extension is present, or returns the required | ||
| // buffer size if the input buffer is too small. | ||
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
There was a problem hiding this comment.
Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); | |
| ::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
| #include "arrow/result.h" | ||
| #include "flatbuffers/flatbuffers.h" | ||
| #include "generated/parquet3_generated.h" | ||
| #include "generated/parquet_types.h" |
There was a problem hiding this comment.
Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.
static std::shared_ptr<FileMetaData> MakeFromFlatbuffer(
const uint8_t* flatbuffer_data,
size_t flatbuffer_size,
uint32_t metadata_len,
const ReaderProperties& properties = default_reader_properties());There was a problem hiding this comment.
Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.
|
FYI @emkornfield @prtkgaur if you want to take a look |
cpp/src/parquet/file_reader.cc
Outdated
| format3::GetFileMetaData(flatbuffer_data.data()); | ||
| auto thrift_metadata = | ||
| std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata)); | ||
| file_metadata_ = FileMetaData::Make( |
There was a problem hiding this comment.
FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.
cpp/src/parquet/properties.h
Outdated
| void set_footer_read_size(size_t size) { footer_read_size_ = size; } | ||
| size_t footer_read_size() const { return footer_read_size_; } | ||
|
|
||
| // If enabled, try to read the metadata3 footer from the file. |
There was a problem hiding this comment.
| // If enabled, try to read the metadata3 footer from the file. | |
| // If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file). |
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } | ||
| void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; } | ||
|
|
There was a problem hiding this comment.
I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.
cpp/src/parquet/properties.h
Outdated
|
|
||
| // If enabled, try to read the metadata3 footer from the file. | ||
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } |
There was a problem hiding this comment.
| bool read_metadata3() const { return read_metadata3_; } | |
| bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; } |
cpp/src/parquet/properties.h
Outdated
| bool page_checksum_verification_ = false; | ||
| // Used with a RecordReader. | ||
| bool read_dense_for_nullable_ = false; | ||
| bool read_metadata3_ = false; |
There was a problem hiding this comment.
I think this should default to true? otherwise I worry about readers getting the benefit?
cpp/src/parquet/metadata3.cc
Outdated
| LZ4_RAW = 7, | ||
| }; | ||
|
|
||
| auto GetNumChildren( |
There was a problem hiding this comment.
style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.
|
|
||
| auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; } | ||
|
|
||
| class ColumnMap { |
| BuildParents(s); | ||
| } | ||
|
|
||
| size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; } |
| std::vector<uint32_t> parents_; | ||
| }; | ||
|
|
||
| struct MinMax { |
cpp/src/parquet/metadata3.cc
Outdated
| uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1; | ||
|
|
||
| // Compute and store checksums and lengths | ||
| uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1); |
There was a problem hiding this comment.
Is this format documented (I might have missed it in the parquet-format pull request).
cpp/src/parquet/metadata3.cc
Outdated
| } while (true); | ||
| } | ||
|
|
||
| inline uint32_t CountLeadingZeros32(uint32_t v) { |
cpp/src/parquet/metadata3.cc
Outdated
| return out; | ||
| } | ||
|
|
||
| inline uint8_t* WriteULEB64(uint64_t v, uint8_t* out) { |
There was a problem hiding this comment.
we should have something like this for delta binary packed, which uses uleb as well, could you look there?
cpp/src/parquet/metadata3.h
Outdated
| // The extension itself is as follows: | ||
| // | ||
| // +-------------------+------------+--------------------------------------+----------------+---------+--------------------------------+------+ | ||
| // | compress(flatbuf) | compressor | crc(compress(flatbuf) .. compressor) | compressed_len | raw_len | crc(compressed_len .. raw_len) | UUID | |
There was a problem hiding this comment.
This should be documented in the parquet-format PR.
eccaba1 to
7d5b33b
Compare
… instead of relative ones. Implement Geo types. Add Float16 and Variant. Pack statistics better
8c94e80 to
72ac531
Compare
Rationale for this change
Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)
What changes are included in this PR?
Definition of the FlatBuffer footer and the generated FlatBuffer file
To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer
Append/Extract FlatBuffer to/from the extension field of the Thrift footer
Use append/extract operations based on reader/writer flags
Are these changes tested?
Yes, with newly added tests.
Are there any user-facing changes?
Yes, users can write and read the FlatBuffer footer to speed up footer parsing.