-
Notifications
You must be signed in to change notification settings - Fork 0
Description
There is an opportunity here to accelerate sampling using speculative decoding. This is especially promising for two reasons:
- We have access to a fast draft model: the token-level model's logprobs which already contain some information about future bytes! (We can also easily extend it to predict further into the future.)
- The byte-level process can be extremely predictable. The suffixes of words (for example) often have very low entropy, which makes them perfect targets for speculation.
There are two main ways this could happen:
Explicit
This is the "normal" setting. The sampling loop needs to handle the speculation itself. We have a drafting method which predicts a configurable number of bytes into the future and a verification function which accepts a variable number of bytes (and the byte distribution following the last accepted byte) according to some sampling parameters.
Implicit
In this setting, the speculation happens entirely internal to the next-byte distribution call. Each time we get a query for the next-byte distribution, we look at the token tree and select draft bytes several bytes into the future. Then, under the assumption those draft bytes are correct, we can add the evaluation nodes to the tree and query them along with the nodes for the current byte. If we are correct, the information required for the following next byte prediction will already be cached and we can return it without querying the model again.