-
Notifications
You must be signed in to change notification settings - Fork 25
Add S3 file caching via cache_httpfs extension #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add optional S3 file caching via DuckDB's cache_httpfs extension. This improves query performance for CTEs/subqueries that read the same Parquet files multiple times (5-10x faster in benchmarks). Configuration (disabled by default): [query] enable_s3_cache = true Fixes Basekick-Labs#147
xe-nvdk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing the in-memory cache approach! This addresses the portability concern I raised in Issue #147.
Review Summary
The implementation uses in_memory cache mode which preserves ephemeral compute - good decision.
Required Changes
1. Bug: Config not wired to database
The EnableS3Cache field is added to database.Config but never set in cmd/arc/main.go:
// cmd/arc/main.go:153-169
dbConfig := &database.Config{
MaxConnections: cfg.Database.MaxConnections,
// ... other fields ...
AzureEndpoint: cfg.Storage.AzureEndpoint,
// EnableS3Cache is missing here!
}Fix: Add EnableS3Cache: cfg.Query.EnableS3Cache, to the dbConfig initialization.
2. Add logging for cache setup
When the cache is enabled/disabled, it should be logged:
if cfg.EnableS3Cache {
logger.Info().Msg("Enabling S3 file caching via cache_httpfs extension")
if _, err := db.Exec("INSTALL cache_httpfs FROM community"); err == nil {
// ...
logger.Info().Msg("cache_httpfs extension loaded with in_memory mode")
} else {
logger.Warn().Err(err).Msg("Failed to install cache_httpfs extension, continuing without cache")
}
}3. Add tests
Please add at least a config loading test in internal/config/config_test.go:
func TestQueryConfig(t *testing.T) {
// Test that enable_s3_cache defaults to false
// Test that it can be set to true
}Questions
- Should we also log the
s3_cache_enabledstatus in the DuckDB initialization log message (line ~84-93)? - Should there be a way to configure the cache size? The default is ~128MB.
Let me know if you have questions!
|
Thanks for the review @xe-nvdk! I'll address all the required changes. For your questions:
|
- Add cache_httpfs extension support for faster S3 reads - Add [query] config section with enable_s3_cache, s3_cache_size, and s3_cache_ttl_seconds options - Add logging for cache setup (info on success, warn on failure) - Add config tests for defaults and environment variable overrides
|
Thanks @khalid244! Great work on this feature. Merged and will ship in Arc 2026.02.1. I added a small follow-up commit to log warnings if any of the SET commands fail (edge case, but good to have visibility). Appreciate you using in-memory caching - keeps Arc's stateless compute philosophy intact. |
xe-nvdk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good! Thank you again
|
Thanks for the merge and the follow-up improvement! Excited to see it ship in 2026.02.1. |
Summary
cache_httpfsextension support for faster S3 reads[query]config section withenable_s3_cache,s3_cache_size, ands3_cache_ttl_secondsoptionsTest plan
TestQueryConfig_Defaults,TestQueryConfig_EnvOverride)