Use of LLM-based inference is evolving from its origins of chat. These days, use cases involve the combination of multiple inference calls, tool calls, and database lookups. RAG, agentic AI, and deep research are three examples of these more sophisticated use cases.
The goal of this project to facilitate optimizations that drastically reduce the cost of inference for RAG, agentics, and deep research (by 10x 1) without harming accuracy. Our approach is to generalize the interface to inference servers via the Span Query.
In a span query, chat is a special case of a more general form. To the right is a visualization of a span query for a "judge/generator" (a.k.a. "LLM-as-a-judge").
Learn more about span query syntax and semantics
SPNL is a library for creating, optimizing, and tokenizing span queries. The library is surfaced for consumption as:
vLLM image | vLLM patch | CLI image | CLI image with Ollama | Rust crate | Python pip | Playground
To kick the tires with SPNL running Ollama:
podman run --rm -it ghcr.io/ibm/spnl-ollama --verboseThis will run a judge/generator email example. You also can point it to a JSON file containing a span query.
First, configure your
environment for Rust. Now
you can build the CLI with cargo build -p spnl-cli, which will
produce ./target/debug/spnl. Adding --release will produce a build
with source code optimizations in ./target/release/spnl.
Usage: spnl [OPTIONS] [FILE]
Arguments:
[FILE] File to process
Options:
-b, --builtin <BUILTIN>
Builtin to run [env: SPNL_BUILTIN=] [possible values: bulkmap, email, email2, email3, sweagent, gsm8k, rag, spans]
-m, --model <MODEL>
Generative Model [env: SPNL_MODEL=] [default: ollama/granite3.3:2b]
-e, --embedding-model <EMBEDDING_MODEL>
Embedding Model [env: SPNL_EMBEDDING_MODEL=] [default: ollama/mxbai-embed-large:335m]
-t, --temperature <TEMPERATURE>
Temperature [default: 0.5]
-l, --max-tokens <MAX_TOKENS>
Max Completion/Generated Tokens [default: 100]
-n, --n <N>
Number of candidates to consider [default: 5]
-k, --chunk-size <CHUNK_SIZE>
Chunk size
--vecdb-uri <VECDB_URI>
Vector DB Url [default: data/spnl]
-r, --reverse
Reverse order
--prepare
Prepare query
-p, --prompt <PROMPT>
Question to pose
-d, --document <DOCUMENT>
Document(s) that will augment the question
-x, --max-aug <MAX_AUG>
Max augmentations to add to the query [env: SPNL_RAG_MAX_MATCHES=]
--shuffle
Randomly shuffle order of fragments
-i, --indexer <INDEXER>
The RAG indexing scheme [possible values: simple-embed-retrieve, raptor]
-s, --show-query
Re-emit the compiled query
--time <TIME>
Report query execution time to stderr [possible values: all, gen, gen1]
-v, --verbose
Verbose output
--dry-run
Dry run (do not execute query)?
-h, --help
Print help (see more with '--help')
-V, --version
Print version