-
Notifications
You must be signed in to change notification settings - Fork 1k
Add structured datasets loading capability in valkey benchmark #2823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use `__field:fieldname__' placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use `--maxdocs` to limit the dataset loading. Signed-off-by: Ram Prasad Voleti <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2823 +/- ##
============================================
- Coverage 74.92% 74.71% -0.21%
============================================
Files 129 130 +1
Lines 71195 71712 +517
============================================
+ Hits 53340 53579 +239
- Misses 17855 18133 +278
🚀 New features to boost your workflow:
|
Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <[email protected]>
JimB123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a full documentation file. Need details on the new file format. Examples.
Maybe seed the file with --help data for the existing cases. This can be an area for future improvement.
src/valkey-benchmark.c
Outdated
| " --dataset <file> Path to CSV/TSV/XML dataset file for field placeholder replacement.\n" | ||
| " All fields auto-detected with natural content lengths.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we reached the point where we need some actual documentation for valkey-benchmark?
The benchmark tool has relied on --help for increasingly complex configurations. Now, we're introducing a dataset configuration file with very limited description. Is it likely that developers will understand how to use this without examining the code?
I suggest that a new documentation file be created (benchmark.md) which can provide details and examples for using valkey-benchmark, including details about this new configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. I agree --help is not clear enough for most of the configurations benchmark support. Will add the detailed explanation with examples.
Add documentation for valkey-benchmark. Fix field discovery. Improve field discovery in xml and validate field before loading the data. fix failed tcl test in ubuntu. Signed-off-by: Ram Prasad Voleti <[email protected]>
Signed-off-by: Ram Prasad Voleti <[email protected]>
Fix memory leak in xml scan Signed-off-by: Ram Prasad Voleti <[email protected]>
|
Nice. I appreciate the benchmark.md file. This is an excellent start and provides a place for future documentation. |
JimB123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the new benchmark.md file
Load fields only we care about from dataset. Parse the fields from benchmark command and load only those field values from dataset Signed-off-by: Ram Prasad Voleti <[email protected]>
Remove warning log for incomplete document Signed-off-by: Ram Prasad Voleti <[email protected]>
|
Have we considered putting this dataset related code in a separate file, maybe |
rainsupreme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally finished reviewing - sorry for the delay! I still think having the dataset stuff in a separate file is a good idea. We can write unit tests for all the parsing and replacement stuff too. :)
I don't think the perf concerns are necessarily blockers, but I want to make sure we've thought about it and are reasonably sure it won't be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is great! I noticed --warmup and --duration options weren't documented. Are there other options not covered? I'd like to either document all of them or have a list of undocumented options here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/valkey-benchmark.c
Outdated
| size_t before_len = field_pos - arg; | ||
| const char *after_start = field_end + FIELD_SUFFIX_LEN; | ||
|
|
||
| sds result = sdsnewlen(arg, before_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno about the perf here. This is run for every argument in every command sent, and I notice it calls strstr() and sdsnewlen() which would typically have O(n) runtime. One can argue that this is effectively constant runtime if the user is reasonable about argument length, but it's still a lot of work to be doing in an inner loop. Also, this appears to be redundant work. My ideas:
- The template we're given doesn't change. We could preprocess more and only do the minimum work here for each command sent.
- There is a lot of allocator work going on here - the caller duplicates the template array for every command plus here we replace some number of those allocations again. The benchmark would run faster if we avoided so many allocations. Perhaps we could allocate a large block of memory that is large enough for the longest command we expect and keep reusing that?
If we don't make this more efficient the user will have to be more careful that the benchmark doesn't get too bottlenecked by valkey-benchmark itself, instead of valkey-server. Users have to do this anyway to some extent, but dataset mode demands much more resources than valkey-benchmark usually requires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the concern on performance. I think about this and definitely gonna be 20-30% improvements in terms of performance with your suggestions. The performance with current change is good enough for full text search needs and saturating the server as desired. I can follow up as future improvements for template array based pre-processing of commands for performance when we know it is a bottleneck.
|
|
||
| test {benchmark: dataset XML with field placeholders} { | ||
| # Create test XML dataset matching Wikipedia structure | ||
| set xml_data "<doc><title>XML Title 1</title><abstract>XML Abstract 1</abstract><url>http://example1.com</url><links><sublink><anchor>test1</anchor><link>http://test1.com</link></sublink></links></doc>\n<doc><title>XML Title 2</title><abstract>XML Abstract 2</abstract><url>http://example2.com</url><links><sublink><anchor>test2</anchor><link>http://test2.com</link></sublink></links></doc>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add whitespace formatting to this so it's more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Separate dataset changes into new file. Add unit test coverage. Signed-off-by: Ram Prasad Voleti <[email protected]>
Signed-off-by: Ram Prasad Voleti <[email protected]>
Fix build Signed-off-by: Ram Prasad Voleti <[email protected]>
Fix cmake build issue for latest ubuntu-cmake Signed-off-by: Ram Prasad Voleti <[email protected]>
3c9ff5f to
4562857
Compare
Replace placeholders with in same field. Update documentation Signed-off-by: Ram Prasad Voleti <[email protected]>
153ba26 to
d088b22
Compare
rainsupreme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple questions about the makefile stuff, but mainly the copyright/license boilerplate headers need to be fixed.
The refactor to a separate file looks good to me though. :)
src/unit/CMakeLists.txt
Outdated
| @@ -23,7 +23,7 @@ add_library(valkeylib STATIC ${VALKEY_SERVER_SRCS}) | |||
| target_compile_options(valkeylib PRIVATE "${COMPILE_FLAGS}") | |||
| target_compile_definitions(valkeylib PRIVATE "${COMPILE_DEFINITIONS}") | |||
|
|
|||
| add_executable(valkey-unit-tests ${UNIT_TEST_SRCS}) | |||
| add_executable(valkey-unit-tests ${UNIT_TEST_SRCS} ${CMAKE_SOURCE_DIR}/src/valkey-benchmark-dataset.c) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with cmake, but his seems a bit odd and I'm confused. Shouldn't it be part of a list of files like UNIT_TEST_SRCS or something? And why added to valkey-unit-tests but not valkey-benchmark?
| @@ -0,0 +1,115 @@ | |||
| /* Unit tests for valkey-benchmark dataset module | |||
| * | |||
| * Copyright (c) 2024, Redis Ltd. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not redis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like copyright isn't fixed yet? It's an easy fix but licensing and copyright are important :p
src/Makefile
Outdated
|
|
||
| # valkey-unit-tests | ||
| $(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME) | ||
| $(ENGINE_UNIT_TESTS): $(ENGINE_TEST_OBJ) $(ENGINE_LIB_NAME) valkey-benchmark-dataset.o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to the cmake file, I feel like this should be part of a obj list or something, right? Is there a good reason it's not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unit testing, we only use APis in valkey-benchmark-dataset and we don't need benchmark binary (or rest of the objects). This is the minimal addition to add unit testing for dataset.
src/valkey-benchmark-dataset.h
Outdated
| @@ -0,0 +1,86 @@ | |||
| /* Dataset support for valkey-benchmark | |||
| * | |||
| * Copyright (c) 2009-2012, Redis Ltd. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not redis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
src/valkey-benchmark-dataset.c
Outdated
| @@ -0,0 +1,776 @@ | |||
| /* Dataset support for valkey-benchmark | |||
| * | |||
| * Copyright (c) 2009-2012, Redis Ltd. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not redis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
src/valkey-benchmark-dataset.c
Outdated
| * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN | ||
| * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) | ||
| * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE | ||
| * POSSIBILITY OF SUCH DAMAGE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This license/copyright boilerplate is unnecessary - you can use a short header like the one in hashtable.c. This applies to all the other files as well.
/*
* Copyright Valkey Contributors.
* All rights reserved.
* SPDX-License-Identifier: BSD 3-Clause
*/
src/valkey-benchmark-dataset.c
Outdated
| } | ||
|
|
||
| static sds getXmlFieldValue(const char *xml_doc, const char *field_name) { | ||
| char start_tag_prefix[128], end_tag[128]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's limit on field length. Is there input checking for this and a uni test? If this length comes up other places then it should be a #define constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will increase this limt as well.
src/valkey-benchmark-dataset.c
Outdated
| return sdsnewlen(content_start, content_len); | ||
| } | ||
|
|
||
| static int csvDiscoverFields(dataset *ds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use bool return type if you #include <stdbool.h> - this helps to clarify what the return value means. (and then of course use true and false instead of 1 and 0). Best to update the whole file with bool for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Will update.
src/valkey-benchmark-dataset.c
Outdated
| } | ||
|
|
||
| static int scanXmlFields(const char *doc_start, const char *doc_end, dataset *ds, const char *start_root_tag, const char *end_root_tag) { | ||
| char field_names[MAX_DATASET_FIELDS][64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now there's a 64 character size limit here? Feels like magic numbers. I'd prefer to see a central #define for these size limits.
I see other 64 character limits below too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to increase this limit just to be future proof for large field testing.
Address the comments on cmake file, field size and count limitations. Update return values to use bool. Increase limits to be future proof. Signed-off-by: Ram Prasad Voleti <[email protected]>
Signed-off-by: Ram Prasad Voleti <[email protected]>
rainsupreme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes look good to me. Just one or two quick fixes away now! :)
| @@ -0,0 +1,115 @@ | |||
| /* Unit tests for valkey-benchmark dataset module | |||
| * | |||
| * Copyright (c) 2024, Redis Ltd. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like copyright isn't fixed yet? It's an easy fix but licensing and copyright are important :p
| zmalloc.o | ||
| ENGINE_BENCHMARK_NAME=$(ENGINE_NAME)-benchmark$(PROG_SUFFIX) | ||
| # Dataset module shared between benchmark and unit tests | ||
| BENCHMARK_DATASET_OBJ=valkey-benchmark-dataset.o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I had imagined adding it to ENGINE_BENCHMARK_OBJ, but I see now that it is used in UTs too. I'm not sure a list with a single item makes sense - perhaps we should just add it to both ENGINE_BENCHMARK_OBJ and ENGINE_UNIT_TESTS instead? 🤔
Background
Currently, valkey-benchmark only supports synthetic data generation through placeholders like
__rand_int__and__data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use
__field:fieldname__placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use--maxdocsto limit the dataset loading.Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using
valkeyFormatCommandArgv(). This ensures:Full-Text Search Benchmarking
Test environment:
Instance: AWS c7i.16xlarge, 64 vCPU
Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory