-
Notifications
You must be signed in to change notification settings - Fork 453
feat: metric batch 2s #7613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: metric batch 2s #7613
Conversation
Signed-off-by: jeremyhi <[email protected]>
Summary of ChangesHello @fengjiachun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance optimization for Prometheus remote write requests by implementing a new batching mechanism. Instead of processing each incoming metric data point immediately, the system now buffers these requests and flushes them to the storage engine either after a configurable time interval or once a maximum number of rows has been accumulated. This change aims to reduce the overhead associated with frequent small writes, leading to more efficient data ingestion. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a batching mechanism for Prometheus remote write requests to improve performance. The implementation uses a per-table worker model with batching based on size and time intervals, which is a solid approach. However, I've found a critical issue in how the physical_table is retrieved from the query context, which will likely cause a compilation failure. Additionally, the graceful shutdown implementation could lead to data loss when the server is stopped. I've provided detailed comments on both issues with suggestions for fixes.
| let physical_table = ctx | ||
| .extension(PHYSICAL_TABLE_KEY) | ||
| .unwrap_or(GREPTIME_PHYSICAL_TABLE) | ||
| .to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for retrieving physical_table from the query context extension appears to be incorrect. The extension is set as a String in http/prom_store.rs, but this code seems to assume it's a &str or has compatible types for unwrap_or.
ctx.extension(PHYSICAL_TABLE_KEY) will likely return an Option<&String>. unwrap_or expects a value of the same type (&String), but GREPTIME_PHYSICAL_TABLE is a &'static str. This will lead to a compilation error due to type mismatch.
The intended logic is likely to use the extension if present, or the default value otherwise. This can be correctly implemented as follows:
| let physical_table = ctx | |
| .extension(PHYSICAL_TABLE_KEY) | |
| .unwrap_or(GREPTIME_PHYSICAL_TABLE) | |
| .to_string(); | |
| let physical_table = ctx | |
| .extension::<String>(PHYSICAL_TABLE_KEY) | |
| .cloned() | |
| .unwrap_or_else(|| GREPTIME_PHYSICAL_TABLE.to_string()); |
| impl Drop for PendingRowsBatcher { | ||
| fn drop(&mut self) { | ||
| let _ = self.shutdown.send(()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current Drop implementation sends a shutdown signal to worker tasks but doesn't wait for them to complete their final flushes. This can lead to data loss during server shutdown, as the main process might exit before the workers have finished writing their pending data.
Additionally, the spawn_flush function creates detached tasks for flushing batches, which are also not tracked. If the server shuts down while these flush tasks are in-flight, they might be terminated prematurely, also resulting in data loss.
To ensure data is not lost on shutdown, I recommend implementing a more robust graceful shutdown mechanism. This would typically involve:
- An
async fn shutdown(&self)method onPendingRowsBatcher. - Tracking all spawned worker and flush tasks (e.g., using a
tokio::task::JoinSet). - The
shutdownmethod would first signal all workers to stop, then wait for all worker and flush tasks to complete. - The server's shutdown sequence would need to be updated to call this new
shutdownmethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a43f5f20de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if batch.row_count == 0 { | ||
| batch.created_at = Some(Instant::now()); | ||
| batch.ctx = Some(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid batching requests with differing query hints
The batch key only includes catalog/schema/physical_table, and the batch context is captured once on the first submit (batch.ctx = Some(ctx)). If two remote_write requests target the same physical table but carry different QueryContext extensions (e.g., x-greptime-hint-ttl, append_mode, skip_wal), they can be merged into the same batch and flushed using the first request’s context. That means later requests’ hints are ignored, leading to writes with the wrong TTL/WAL or table options. Consider either including relevant context options in the batch key or keeping per-request context on a per-chunk basis.
Useful? React with 👍 / 👎.
Signed-off-by: jeremyhi <[email protected]>
Signed-off-by: jeremyhi <[email protected]>
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
An exploration and test of a potential performance optimization, not necessarily to be merged.
PR Checklist
Please convert it to a draft if some of the following conditions are not met.