Skip to content

Conversation

@sspaink
Copy link
Member

@sspaink sspaink commented Dec 3, 2025

Why the changes in this PR are needed?

resolve: #7455

What are the changes in this PR?

This introduces a new trigger mode for the decision log plugin:

decision_logs.reporting.trigger=immediate

The immediate trigger mode will upload events as soon as enough events are received to hit the configured upload limit. If not enough events are received within the configured min-max delay, the events received so far are flushed and uploaded.

For the event buffer type this means if enough events are received they could be uploaded sooner than the configured min-max delay allowing the buffer to empty quicker preventing any dropped events. While for the size buffer uploads could also happen sooner but regardless of the trigger mode dropped events aren't as likely, given the default unlimited size and the fact the events are stored as chunks. Although in the immediate mode both buffer types do allow the chunks of events to be uploaded as a stream, opposed to multiple chunks uploaded in bursts.

Notes to assist PR review:

A contrived example to help demonstrate the benefit, using a small buffer size limit and a long delay time:

Setup the following config (opa-conf.yaml):

services:
  logeater:
    url: http://localhost:8080

status:
  console: true

decision_logs:
  service: logeater
  reporting:
    buffer_type: event
    trigger: periodic
    buffer_size_limit_events: 100
    min_delay_seconds: 10
    max_delay_seconds: 20

Have this simple Rego file (example.rego)

package example

allow if {
    true
}
  1. Run OPA: ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego
  2. Run the logeater service (just a service to receive the logs): go run main.go
  3. Attack OPA with 5000 events: echo 'POST http://localhost:8181/v1/data/example/allow' | vegeta attack --duration=10s -rate=500 | tee results.bin | vegeta report

Now if you check http://localhost:8181/v1/status you will see a shocking metric counter_decision_logs_dropped_buffer_size_limit_exceeded 4800. This is because vegeta is sending 500 requests per second for 10 seconds and the buffer managed to send only 100 events. The other 100 are in the buffer.

Now if you update the config to use the new trigger mode trigger: immediate, checking /v1/status again you will see no events were dropped! You do see some other fun metrics of the encoder attempting to adjust the guessed uncompressed limit:

{
    "counter_enc_uncompressed_limit_scale_down": 7,
    "counter_enc_uncompressed_limit_scale_up": 10
}

These metrics didn't show up for the periodic mode because they are reported by the encoder, which didn't get run enough times to scale the uncompressed limit.

Data

Attacking OPA configured with different buffer types and triggers for 30 seconds also illustrates what I described above. Periodic uploads in bursts and the immediate as a stream. Looks like the encoder stabilizes trying to guess the uncompressed limit sooner with the immediate mode with the event buffer as well. Using the default size limits no events are dropped.

Used an updated logeater service that spits out a graph (code here).

Event, Immediate

buffer_type: event
trigger: immediate
min_delay_seconds: 10
max_delay_seconds: 20

Average Duration between uploads: 970.387093ms
Max Duration between uploads: 983.76925ms

image

Event, Periodic

buffer_type: event
trigger: periodic
min_delay_seconds: 10
max_delay_seconds: 20

Average Duration between uploads: 593.645934ms
Max Duration between uploads: 16.326999458s

image

Size, Immediate

buffer_type: size
trigger: immediate
min_delay_seconds: 10
max_delay_seconds: 20

Average Duration between uploads: 974.381334ms
Max Duration between uploads: 3.96508625s

image

Size, Periodic

Dropped chunks of event, and gaps between uploads

buffer_type: size
trigger: periodic
min_delay_seconds: 10
max_delay_seconds: 20

Average Duration between uploads: 439.323843ms
Max Duration between uploads: 12.64677575s

image

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
@netlify
Copy link

netlify bot commented Dec 3, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 9f7c4c6
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/69375e488ec9c10008e6fc8e
😎 Deploy Preview https://deploy-preview-8110--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

sspaink and others added 5 commits December 3, 2025 18:01
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
@sspaink sspaink marked this pull request as ready for review December 5, 2025 00:23
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <3441183+sspaink@users.noreply.github.com>
@netlify
Copy link

netlify bot commented Dec 15, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 0886246
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/694044cb057d4500089a9847
😎 Deploy Preview https://deploy-preview-8110--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

sspaink and others added 3 commits December 15, 2025 11:12
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
@netlify
Copy link

netlify bot commented Jan 5, 2026

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit baa066f
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/695bdcc7ada2580008b6590e
😎 Deploy Preview https://deploy-preview-8110--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Some thoughts/questions.

retry++
} else {
retry = 0
timer := time.NewTimer(delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not resetting the timer when a flush has been triggered in immediate mode? I.e. if we a have fraction of the timer delay left, and a new log event has triggered an upload, that fraction will be added on top of the next timer delay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that is the case. I struggled implementing a solution that would reset this timer consistently on immediate upload. I had tried with a new channel that would reset the timer, but then the case where the timer triggers before the channel can send the upload required a mutex that just made things even more complicated. So I decided the added time in this scenario was acceptable given the events should usually be uploaded immediately and not rely on the timer.

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
@netlify
Copy link

netlify bot commented Jan 13, 2026

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 63e874d
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/6977c4f17e9c160008fa6e2d
😎 Deploy Preview https://deploy-preview-8110--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution should work, I think 👍.

Is the reconfiguration concern warranted, and could something be made about it if so?

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

case item := <-b.buffer:
b.immediateRead(ctx, item)
case done := <-b.stop:
b.flush(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this flush necessary, considering we expect the outer plugin to immediately call Flush() on the buffer anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when reconfiguring the same buffer, Reconfigure() will move events between buffers, anyways, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving events between the buffers has made it possible to remove this, deleted 👍

}

func (b *sizeBuffer) Reconfigure(bufferSizeLimitBytes int64, uploadSizeLimitBytes int64, maxDecisionsPerSecond *float64) {
func (b *sizeBuffer) Reconfigure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complexity-wise, I wonder if we really need this buffer-specific reconfigure 🤔. If the plugin always tore down the old, flushed, and set up a new buffer on config change, we'd have one less edge-case to worry about.
That might come with some hit to performance, but config changes are rare enough that I'm not sure that's a big concern.
On the other hand, if we keep redesigning this, we'll never get it done 😄. Fine to leave it as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing that edge case is much better! getting rid of any complexity in this code is huge win. I can't imagine people are re-configuring often while OPA is running.

I removed the individual reconfigure methods and now it always creates a new instance of the buffer.

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
…ng the plugin

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
…osed and restart main loop

Signed-off-by: Sebastian Spaink <sebastianspaink@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decision log plugin: trigger upload as soon as the buffer limit is reached

2 participants