Skip to content

Conversation

@fengjiachun
Copy link
Collaborator

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

An exploration and test of a potential performance optimization, not necessarily to be merged.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@fengjiachun fengjiachun requested a review from a team as a code owner January 26, 2026 04:08
@github-actions github-actions bot added size/M docs-not-required This change does not impact docs. labels Jan 26, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @fengjiachun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization for Prometheus remote write requests by implementing a new batching mechanism. Instead of processing each incoming metric data point immediately, the system now buffers these requests and flushes them to the storage engine either after a configurable time interval or once a maximum number of rows has been accumulated. This change aims to reduce the overhead associated with frequent small writes, leading to more efficient data ingestion.

Highlights

  • New Batching Mechanism: Introduced a PendingRowsBatcher to buffer Prometheus remote write requests, improving performance by flushing data in larger batches rather than individually.
  • Configurable Batching Parameters: Added new configuration options, pending_rows_flush_interval (default 2 seconds) and max_batch_rows (default 100,000), to control the batcher's behavior.
  • Enhanced Error Handling: New error types, BatcherChannelClosed and BatcherQueueFull, were added to handle specific issues related to the batching process, with BatcherQueueFull mapping to a RateLimited status code.
  • New Metrics for Monitoring: Integrated new Prometheus metrics (PENDING_BATCHES, PENDING_ROWS, FLUSH_TOTAL, FLUSH_ROWS) to provide visibility into the batcher's operational status and performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a batching mechanism for Prometheus remote write requests to improve performance. The implementation uses a per-table worker model with batching based on size and time intervals, which is a solid approach. However, I've found a critical issue in how the physical_table is retrieved from the query context, which will likely cause a compilation failure. Additionally, the graceful shutdown implementation could lead to data loss when the server is stopped. I've provided detailed comments on both issues with suggestions for fixes.

Comment on lines +44 to +47
let physical_table = ctx
.extension(PHYSICAL_TABLE_KEY)
.unwrap_or(GREPTIME_PHYSICAL_TABLE)
.to_string();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for retrieving physical_table from the query context extension appears to be incorrect. The extension is set as a String in http/prom_store.rs, but this code seems to assume it's a &str or has compatible types for unwrap_or.

ctx.extension(PHYSICAL_TABLE_KEY) will likely return an Option<&String>. unwrap_or expects a value of the same type (&String), but GREPTIME_PHYSICAL_TABLE is a &'static str. This will lead to a compilation error due to type mismatch.

The intended logic is likely to use the extension if present, or the default value otherwise. This can be correctly implemented as follows:

Suggested change
let physical_table = ctx
.extension(PHYSICAL_TABLE_KEY)
.unwrap_or(GREPTIME_PHYSICAL_TABLE)
.to_string();
let physical_table = ctx
.extension::<String>(PHYSICAL_TABLE_KEY)
.cloned()
.unwrap_or_else(|| GREPTIME_PHYSICAL_TABLE.to_string());

Comment on lines +137 to +140
impl Drop for PendingRowsBatcher {
fn drop(&mut self) {
let _ = self.shutdown.send(());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current Drop implementation sends a shutdown signal to worker tasks but doesn't wait for them to complete their final flushes. This can lead to data loss during server shutdown, as the main process might exit before the workers have finished writing their pending data.

Additionally, the spawn_flush function creates detached tasks for flushing batches, which are also not tracked. If the server shuts down while these flush tasks are in-flight, they might be terminated prematurely, also resulting in data loss.

To ensure data is not lost on shutdown, I recommend implementing a more robust graceful shutdown mechanism. This would typically involve:

  1. An async fn shutdown(&self) method on PendingRowsBatcher.
  2. Tracking all spawned worker and flush tasks (e.g., using a tokio::task::JoinSet).
  3. The shutdown method would first signal all workers to stop, then wait for all worker and flush tasks to complete.
  4. The server's shutdown sequence would need to be updated to call this new shutdown method.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a43f5f20de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +199 to +201
if batch.row_count == 0 {
batch.created_at = Some(Instant::now());
batch.ctx = Some(ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid batching requests with differing query hints

The batch key only includes catalog/schema/physical_table, and the batch context is captured once on the first submit (batch.ctx = Some(ctx)). If two remote_write requests target the same physical table but carry different QueryContext extensions (e.g., x-greptime-hint-ttl, append_mode, skip_wal), they can be merged into the same batch and flushed using the first request’s context. That means later requests’ hints are ignored, leading to writes with the wrong TTL/WAL or table options. Consider either including relevant context options in the batch key or keeping per-request context on a per-chunk basis.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant