-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48858: [C++][Parquet] Avoid re-serializing footer for signature verification #48859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cpp/src/parquet/metadata.h
Outdated
| PARQUET_DEPRECATED("Deprecated in 24.0.0. Use the two-argument overload instead.") | ||
| bool VerifySignature(const void* signature); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to deprecate this but we might also remove it, as it seems this API is meant for internal use? @adamreeve @EnricoMi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can't see why this would need to be part of the public API. In which case, the new method should probably be private or internal too.
If we remove it and make the new method internal then this could block someone from upgrading if they had a legitimate use case for this. Maybe we should deprecate it and suggest users make an issue if they need this functionality? Also OK by me if you just want to remove it though, it seems unlikely this would be used.
cpp/src/parquet/metadata.cc
Outdated
| bool VerifySignature(std::span<const uint8_t> serialized_metadata, | ||
| std::span<const uint8_t> signature) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this as a method of FileMetaData but the only member it uses is the FileDecryptor, so perhaps this should be moved elsewhere (or made static?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Hmm, it looks like we need to wait for #48819 for the R failures. |
|
If I remember correctly in rust we verify directly on decrypted buffer (skipping reserialization). |
I'm not surprised that Rust is saner than C++ :-) |
35d6c51 to
e853dee
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: e853dee Submitted crossbow builds: ursacomputing/crossbow @ actions-f90353001e |
e853dee to
1bc9ff5
Compare
|
@github-actions crossbow submit fuzz |
|
Revision: 1bc9ff5 Submitted crossbow builds: ursacomputing/crossbow @ actions-35c4966c5a
|
|
CI failures are unrelated. |
|
@adamreeve @EnricoMi @wgtmac Are you available to review this? |
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me thanks @pitrou. I don't feel too strongly either way about how to handle the deprecation and method visibility.
Yes, it looks like we started out copying the C++ approach and then switched to encrypting the metadata buffer directly for similar reasons: apache/arrow-rs#7459 (comment) |
cpp/src/parquet/metadata.cc
Outdated
| bool VerifySignature(std::span<const uint8_t> serialized_metadata, | ||
| std::span<const uint8_t> signature) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| auto file_decryptor = std::make_shared<InternalFileDecryptor>( | ||
| file_decryption_properties, file_aad, algo.algorithm, | ||
| file_metadata_->footer_signing_key_metadata(), properties_.memory_pool()); | ||
| // set the InternalFileDecryptor in the metadata as well, as it's used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make verify signature as a static function, we can remove set_file_decryptor as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the decryptor is passed down to row group and column metadata, and is necessary to decrypt encrypted column metadata.
cpp/src/parquet/metadata.cc
Outdated
| std::shared_ptr<Buffer> encrypted_buffer = | ||
| AllocateBuffer(file_decryptor_->pool(), | ||
| aes_encryptor->CiphertextLength(serialized_metadata.size())); | ||
| int32_t encrypted_len = aes_encryptor->SignedFooterEncrypt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we produce the signature by encrypting the serialized metadata? Can't we compute the signature from the encrypted buffer? Or is the signature a property of the encryption process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, the signature is part of the AEAD process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quote the spec about plaintext footers:
The footer signing is done by encrypting the serialized FileMetaData structure with the AES GCM algorithm - using a footer signing key, and an AAD constructed according to the instructions of the section 4.4. Only the nonce and GCM tag are stored in the file – as a 28-byte fixed-length array, written right after the footer itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the signature we are talking about is the concatenation of the nonce and the AES-GCM "authentication tag".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we obtain the serialized metadata? Does this come from decrypting the encrypted metadata? In that case, we should not need to compute and compare the signature because decryption should fail in the first place if signature does not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is plaintext footer mode, so we encrypt the footer (metadata) but only store the signature, not the encrypted metadata. To verify integrity, we encrypt again to compare the signatures. There is no encrypted metadata along with the signature, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context, the footer (i.e. the serialized metadata) is stored in plaintext, not encrypted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(you're right, when the footer is encrypted, checking the signature is part of the decryption process and we don't need to do it manually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, looks like we posted our messages at the same time :) So, yes.
1bc9ff5 to
e8fea7e
Compare
|
I've made the new method static and private, and updated the deprecation message to ask users to contact us if they need the deprecated functionality. |
|
@github-actions crossbow submit -g cpp |
|
Revision: e8fea7e Submitted crossbow builds: ursacomputing/crossbow @ actions-d1830a6afa |
Rationale for this change
When reading an encrypted Parquet file with a plaintext footer, the Parquet reader is able to verify footer integrity by comparing the signature in the file with the one computed by encrypting the footer.
However, the way it does this is to first re-serializes the deserialized footer using Thrift. This has several issues:
Reason 3 is what allowed this to be uncovered by OSS-Fuzz (see https://oss-fuzz.com/testcase-detail/4740205688193024).
This PR switches to reusing the original serialized metadata.
Are these changes tested?
Yes, by existing tests and new fuzz regression file.
Are there any user-facing changes?
No.