Theory of Operation

Valkey Search Module Architecture

The Valkey Search module is an extensible platform for efficiently performing search and processing operations of data stored in Valkey keys. Valkey indexes are defined to “cover” a portion of the Valkey keyspace. Any change to a key within a covered portion of the keyspace automatically updates the associated index. Query operations, composed from a pallet of data-type specific search operators are executed against the index to efficiently locate keys within the covered keyspace. Once located, substantial server-side processing of the contents of the located keys can be performed. The Search module allows any number of indexes to be created and their corresponding keyspace coverage can have any arbitrary overlap. Except as noted, all further discussion in this document is on a per-index basis, but it should be understood that each index generally operates independently.

In CME, the search module allows indexes to span nodes and performs any necessary cross-node operations automatically. The search module handles this situation by constructing an index on each shard with the keys local to that shard. Index updates due to mutations are handled purely locally, i.e., a key modification results in the update of the local index-shard and has no effect on any other cluster member. Conversely, query operations are broadcast to all shards in a cluster (either primary or replica) and their intermediate results combined into a single result for the query operation. The code to broadcast and integrate is present on each cluster member, allowing query operations to be performed by any cluster node. Maximum throughput requires that queries be evenly distributed across the cluster.

One consequence of this distribution technique is that horizontal scaling will only increase the ingestion rate, not the query rate. Increasing the query throughput requires vertical and/or replica scaling. Substantial development effort will be required to modify the existing Skyhook scaling policies and mechanism to handle this new scaling paradigm. Modifications of the auto-scaling service for instance-based clusters may also be desirable.

While not planned for implementation in the initial version, this architecture supports the use of Valkey’s slot-tag notation to confine the keyspace for an index to a single slot. This type of index would scale more like typical Valkey workloads in that horizontal scaling would increase both the ingestion and query rate — provided that operations could be distributed across multiple indexes (i.e., no hot-slots).

Architecturally, the search module divides into three subsystems: ingestion, query and metadata which all share a common data structure — the index. Both the ingestion and query subsystems are multi-threaded and thus require a policy to share the common index. Access to the shared index is granted on a per-subsystem basis, i.e., at any instant in time the shared index is exclusively used to perform either query, ingestion or maintenance operations. Machinery is provided to administratively control the allocation of CPU time between the various subsystems, providing direct control over the allocation of CPU resources.

Ingestion Subsystem

Ingestion operations are handled through a keyspace notification callback. The callback extracts any indexed fields and queues them for processing into the shared index. On a primary, the client is blocked until the index update is completed, providing the application with read after write consistency (In CME consistency only exists for the current shard, not the entire database). Thus to obtain maximum ingestion throughput an application must use multiple parallel connections (NOT pipelining within a single connection) sufficient to saturate the index update process.

The actual process of updating the indexes is parallelized at the field level. Individual keys are decomposed into their constituent fields with the updating of each per-field index done in parallel. A key-tracking data structure is maintained to ensure the integrity (atomicity) of the overall index as observed by the other subsystems. Specifically, it ensures that access to the indexes is only yielded when whole keys or whole multi/exec/LUA batches have completed their respective field-index updates. The key tracker is also used to drive the unblocking of clients.

Query Subsystem

The execution query commands (FT.SEACH and FT.AGGREGATE) are broken into four phases.

The command is validated and a query plan is constructed.
The index is searched in accordance with the query plan, yielding as output a set of key names and scores (score would be vector distance or BM-25 relevance).
The Valkey key database is accessed to fetch the contents of keys found in phase two.
The contents of the keys are processed and the final result is returned to the client.

CMD Operation

In CMD, the second phase (index search) operates by blocking the client (except for multi/exec/LUA where the main thread waits) while the indexes are searched in a background thread. Once the index search has completed, execution of the third and fourth phases resumes on the main thread. Overview-CMD

CME Operation

In CME, the second and third phases must be performed by each shard in the cluster. At the completion of the first phase, the originating node (aka the coordinating node) packages the query plan and issues a gRPC request to all other shards in the cluster (randomly selecting between primaries and replicas to perform load balancing). Each shard has a gRPC receiver thread that receives the request and submits the query to the local index subsystem. Once the query has completed, phase three is executed on the main thread, fetching the contents of the local keys from phase two. The results from phase three are transmitted back to the coordinating node, which merges the responses from all shards and queues the integrated result to the main thread for the processing of phase four.

For query operations, the system prioritizes consistency over availability. If a shard fails its portion of the operation, the coordinating node will repeat that operation until the timeout is reached — at which point the command is terminated with an error. Overview-CME

Metadata Subsystem

The metadata of the search module is the definition of the indexes. This subsystem is responsible for ensuring that the metadata is consistent across all nodes in a cluster. To provide self-healing and eventual consistency of the metadata, an epoch and Merkle-tree checksum of all local metadata is periodically broadcast over the cluster bus by each node (every 30 seconds with jitter). Nodes with older epochs or with checksum disagreements automatically fetch the latest version of the metadata from the broadcasting node via gRRC using an epoch-based last-writer-wins conflict resolution strategy. Conflicts with the metadata of an existing index will cause that index to be dropped, recreated and backfilled.

Explicit Metadata mutating commands (FT.CREATE, FT.DROPINDEX, etc.) are validated and applied to the receiving node’s metadata immediately, triggering cluster-wide consistency event. These commands are blocked until either the cluster achieves consistency or a timeout is reached.

Save/Restore

If no indexes are defined, then no data from the search module is placed into the RDB file, meaning that the mere presence of the module’s code has no effect on the portability of a generated RDB file. When indexes are defined, the index metadata is always saved. However, the per-field indexes can be saved or regenerated at load time from the Valkey keys themselves. Currently, only vector field indexes are saved as regenerating this field-index type can be quite expensive, the other field types are regenerated at reload time. Going forward, the RDB format is expressly designed to allow the other field-types to also save their internal indexes should it become desirable to do so.

Theory of Operation

Valkey Search Module Architecture

Ingestion Subsystem

Query Subsystem

CMD Operation

CME Operation

Metadata Subsystem

Save/Restore

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally