Skip to content

Speculative decoding #3

@PythonNut

Description

@PythonNut

There is an opportunity here to accelerate sampling using speculative decoding. This is especially promising for two reasons:

  1. We have access to a fast draft model: the token-level model's logprobs which already contain some information about future bytes! (We can also easily extend it to predict further into the future.)
  2. The byte-level process can be extremely predictable. The suffixes of words (for example) often have very low entropy, which makes them perfect targets for speculation.

There are two main ways this could happen:

Explicit

This is the "normal" setting. The sampling loop needs to handle the speculation itself. We have a drafting method which predicts a configurable number of bytes into the future and a verification function which accepts a variable number of bytes (and the byte distribution following the last accepted byte) according to some sampling parameters.

Implicit

In this setting, the speculation happens entirely internal to the next-byte distribution call. Each time we get a query for the next-byte distribution, we look at the token tree and select draft bytes several bytes into the future. Then, under the assumption those draft bytes are correct, we can add the evaluation nodes to the tree and query them along with the nodes for the current byte. If we are correct, the information required for the following next byte prediction will already be cached and we can return it without querying the model again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions