Add structured datasets loading capability in valkey benchmark #2823

VoletiRam · 2025-11-10T22:14:33Z

Background

Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.

Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use __field:fieldname__ placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use --maxdocs to limit the dataset loading.

Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using valkeyFormatCommandArgv(). This ensures:

Zero impact on existing functionality
Full support for variable-size content
Thread-safe atomic record iteration
Compatible with pipelining and threading modes

# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"

# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"

# XML dataset with document limit
./valkey-benchmark --dataset wiki.xml --xml-root-element doc --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"

# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__

Full-Text Search Benchmarking

# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Search miss scenarios (non-existent terms)  
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"

Test environment:
Instance: AWS c7i.16xlarge, 64 vCPU

Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory

Configuration	Throughput	CPU Usage	Wall Time	Memory Peak
Single-threaded, P1	93,295 RPS	99%	71.4s	5.8GB
Multi-threaded (10), P1	93,332 RPS	137%	71.5s	5.8GB
Single-threaded, P10	274,499 RPS	96%	36.1s	5.8GB
Multi-threaded (4), P10	344,589 RPS	161%	32.4s	5.8GB

Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use `__field:fieldname__' placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use `--maxdocs` to limit the dataset loading. Signed-off-by: Ram Prasad Voleti <[email protected]>

codecov · 2025-11-10T22:32:50Z

Codecov Report

❌ Patch coverage is 84.19048% with 83 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.71%. Comparing base (1fd6d71) to head (6ff88c9).

Files with missing lines	Patch %	Lines
src/valkey-benchmark-dataset.c	83.74%	73 Missing ⚠️
src/valkey-benchmark.c	86.84%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2823      +/-   ##
============================================
- Coverage     74.92%   74.71%   -0.21%     
============================================
  Files           129      130       +1     
  Lines         71195    71712     +517     
============================================
+ Hits          53340    53579     +239     
- Misses        17855    18133     +278

Files with missing lines	Coverage Δ
src/valkey-benchmark.c	`62.76% <86.84%> (+1.26%)`	⬆️
src/valkey-benchmark-dataset.c	`83.74% <83.74%> (ø)`

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <[email protected]>

JimB123

Need a full documentation file. Need details on the new file format. Examples.

Maybe seed the file with --help data for the existing cases. This can be an area for future improvement.

JimB123 · 2025-11-13T18:16:51Z

src/valkey-benchmark.c

+        " --dataset <file>   Path to CSV/TSV/XML dataset file for field placeholder replacement.\n"
+        "                    All fields auto-detected with natural content lengths.\n",


Have we reached the point where we need some actual documentation for valkey-benchmark?

The benchmark tool has relied on --help for increasingly complex configurations. Now, we're introducing a dataset configuration file with very limited description. Is it likely that developers will understand how to use this without examining the code?

I suggest that a new documentation file be created (benchmark.md) which can provide details and examples for using valkey-benchmark, including details about this new configuration file.

Thank you for the suggestion. I agree --help is not clear enough for most of the configurations benchmark support. Will add the detailed explanation with examples.

Add documentation for valkey-benchmark. Fix field discovery. Improve field discovery in xml and validate field before loading the data. fix failed tcl test in ubuntu. Signed-off-by: Ram Prasad Voleti <[email protected]>

Signed-off-by: Ram Prasad Voleti <[email protected]>

Fix memory leak in xml scan Signed-off-by: Ram Prasad Voleti <[email protected]>

JimB123 · 2025-11-21T20:53:24Z

Nice. I appreciate the benchmark.md file. This is an excellent start and provides a place for future documentation.

JimB123

I really like the new benchmark.md file

Load fields only we care about from dataset. Parse the fields from benchmark command and load only those field values from dataset Signed-off-by: Ram Prasad Voleti <[email protected]>

Remove warning log for incomplete document Signed-off-by: Ram Prasad Voleti <[email protected]>

rainsupreme · 2025-12-16T00:06:34Z

Have we considered putting this dataset related code in a separate file, maybe valkey-benchmark-datsets.c? It seems like it should be fairly self-contained.

rainsupreme

Finally finished reviewing - sorry for the delay! I still think having the dataset stuff in a separate file is a good idea. We can write unit tests for all the parsing and replacement stuff too. :)

I don't think the perf concerns are necessarily blockers, but I want to make sure we've thought about it and are reasonably sure it won't be an issue.

rainsupreme · 2025-11-21T22:03:07Z

docs/benchmark.md

This doc is great! I noticed --warmup and --duration options weren't documented. Are there other options not covered? I'd like to either document all of them or have a list of undocumented options here. :)

rainsupreme · 2026-01-01T23:50:38Z

src/valkey-benchmark.c

+    size_t before_len = field_pos - arg;
+    const char *after_start = field_end + FIELD_SUFFIX_LEN;
+
+    sds result = sdsnewlen(arg, before_len);


I dunno about the perf here. This is run for every argument in every command sent, and I notice it calls strstr() and sdsnewlen() which would typically have O(n) runtime. One can argue that this is effectively constant runtime if the user is reasonable about argument length, but it's still a lot of work to be doing in an inner loop. Also, this appears to be redundant work. My ideas:

The template we're given doesn't change. We could preprocess more and only do the minimum work here for each command sent.

There is a lot of allocator work going on here - the caller duplicates the template array for every command plus here we replace some number of those allocations again. The benchmark would run faster if we avoided so many allocations. Perhaps we could allocate a large block of memory that is large enough for the longest command we expect and keep reusing that?

If we don't make this more efficient the user will have to be more careful that the benchmark doesn't get too bottlenecked by valkey-benchmark itself, instead of valkey-server. Users have to do this anyway to some extent, but dataset mode demands much more resources than valkey-benchmark usually requires.

Appreciate the concern on performance. I think about this and definitely gonna be 20-30% improvements in terms of performance with your suggestions. The performance with current change is good enough for full text search needs and saturating the server as desired. I can follow up as future improvements for template array based pre-processing of commands for performance when we know it is a bottleneck.

rainsupreme · 2026-01-01T23:51:07Z

tests/integration/valkey-benchmark.tcl

+
+        test {benchmark: dataset XML with field placeholders} {
+            # Create test XML dataset matching Wikipedia structure
+            set xml_data "<doc><title>XML Title 1</title><abstract>XML Abstract 1</abstract><url>http://example1.com</url><links><sublink><anchor>test1</anchor><link>http://test1.com</link></sublink></links></doc>\n<doc><title>XML Title 2</title><abstract>XML Abstract 2</abstract><url>http://example2.com</url><links><sublink><anchor>test2</anchor><link>http://test2.com</link></sublink></links></doc>"


Could we add whitespace formatting to this so it's more readable?

Separate dataset changes into new file. Add unit test coverage. Signed-off-by: Ram Prasad Voleti <[email protected]>

Signed-off-by: Ram Prasad Voleti <[email protected]>

Fix build Signed-off-by: Ram Prasad Voleti <[email protected]>

Fix cmake build issue for latest ubuntu-cmake Signed-off-by: Ram Prasad Voleti <[email protected]>

Replace placeholders with in same field. Update documentation Signed-off-by: Ram Prasad Voleti <[email protected]>

rainsupreme

I had a couple questions about the makefile stuff, but mainly the copyright/license boilerplate headers need to be fixed.

The refactor to a separate file looks good to me though. :)

rainsupreme · 2026-01-20T20:29:42Z

src/unit/CMakeLists.txt

@@ -23,7 +23,7 @@ add_library(valkeylib STATIC ${VALKEY_SERVER_SRCS})
 target_compile_options(valkeylib PRIVATE "${COMPILE_FLAGS}")
 target_compile_definitions(valkeylib PRIVATE "${COMPILE_DEFINITIONS}")

-add_executable(valkey-unit-tests ${UNIT_TEST_SRCS})
+add_executable(valkey-unit-tests ${UNIT_TEST_SRCS} ${CMAKE_SOURCE_DIR}/src/valkey-benchmark-dataset.c)


I'm not familiar with cmake, but his seems a bit odd and I'm confused. Shouldn't it be part of a list of files like UNIT_TEST_SRCS or something? And why added to valkey-unit-tests but not valkey-benchmark?

rainsupreme · 2026-01-20T20:31:32Z

src/unit/test_dataset.c

@@ -0,0 +1,115 @@
+/* Unit tests for valkey-benchmark dataset module
+ *
+ * Copyright (c) 2024, Redis Ltd.


we're not redis

looks like copyright isn't fixed yet? It's an easy fix but licensing and copyright are important :p

rainsupreme · 2026-01-20T20:34:41Z

src/Makefile


 # valkey-unit-tests
-$(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME)
+$(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME) valkey-benchmark-dataset.o


similar to the cmake file, I feel like this should be part of a obj list or something, right? Is there a good reason it's not?

For unit testing, we only use APis in valkey-benchmark-dataset and we don't need benchmark binary (or rest of the objects). This is the minimal addition to add unit testing for dataset.

rainsupreme · 2026-01-20T20:34:58Z

src/valkey-benchmark-dataset.h

@@ -0,0 +1,86 @@
+/* Dataset support for valkey-benchmark
+ *
+ * Copyright (c) 2009-2012, Redis Ltd.


we are not redis

rainsupreme · 2026-01-20T20:35:09Z