Skip to content

IBM/spnl

Repository files navigation

Span Queries

arXiv Crates.io - Version PyPI - Version CI - Core CI - Python CI - Playground GitHub License

Use of LLM-based inference is evolving from its origins of chat. These days, use cases involve the combination of multiple inference calls, tool calls, and database lookups. RAG, agentic AI, and deep research are three examples of these more sophisticated use cases.

The goal of this project to facilitate optimizations that drastically reduce the cost of inference for RAG, agentics, and deep research (by 10x 1) without harming accuracy. Our approach is to generalize the interface to inference servers via the Span Query.

In a span query, chat is a special case of a more general form. To the right is a visualization of a span query for a "judge/generator" (a.k.a. "LLM-as-a-judge").

Learn more about span query syntax and semantics

Getting Started with SPNL

SPNL is a library for creating, optimizing, and tokenizing span queries. The library is surfaced for consumption as:

vLLM image | vLLM patch | CLI image | CLI image with Ollama | Rust crate | Python pip | Playground

To kick the tires with SPNL running Ollama:

podman run --rm -it ghcr.io/ibm/spnl-ollama --verbose

This will run a judge/generator email example. You also can point it to a JSON file containing a span query.

Building SPNL

First, configure your environment for Rust. Now you can build the CLI with cargo build -p spnl-cli, which will produce ./target/debug/spnl. Adding --release will produce a build with source code optimizations in ./target/release/spnl.

CLI Usage

Usage: spnl [OPTIONS] [FILE]

Arguments:
  [FILE]  File to process

Options:
  -b, --builtin <BUILTIN>
          Builtin to run [env: SPNL_BUILTIN=] [possible values: bulkmap, email, email2, email3, sweagent, gsm8k, rag, spans]
  -m, --model <MODEL>
          Generative Model [env: SPNL_MODEL=] [default: ollama/granite3.3:2b]
  -e, --embedding-model <EMBEDDING_MODEL>
          Embedding Model [env: SPNL_EMBEDDING_MODEL=] [default: ollama/mxbai-embed-large:335m]
  -t, --temperature <TEMPERATURE>
          Temperature [default: 0.5]
  -l, --max-tokens <MAX_TOKENS>
          Max Completion/Generated Tokens [default: 100]
  -n, --n <N>
          Number of candidates to consider [default: 5]
  -k, --chunk-size <CHUNK_SIZE>
          Chunk size
      --vecdb-uri <VECDB_URI>
          Vector DB Url [default: data/spnl]
  -r, --reverse
          Reverse order
      --prepare
          Prepare query
  -p, --prompt <PROMPT>
          Question to pose
  -d, --document <DOCUMENT>
          Document(s) that will augment the question
  -x, --max-aug <MAX_AUG>
          Max augmentations to add to the query [env: SPNL_RAG_MAX_MATCHES=]
      --shuffle
          Randomly shuffle order of fragments
  -i, --indexer <INDEXER>
          The RAG indexing scheme [possible values: simple-embed-retrieve, raptor]
  -s, --show-query
          Re-emit the compiled query
      --time <TIME>
          Report query execution time to stderr [possible values: all, gen, gen1]
  -v, --verbose
          Verbose output
      --dry-run
          Dry run (do not execute query)?
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

Footnotes

  1. https://arxiv.org/html/2409.15355v5

About

Span Queries: What if we had a way to plan and optimize GenAI like we do for SQL?

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors 7