Photo by Drew Beamer on Unsplash

LLMs have revolutionised natural language understanding and generation. Whether you are building chatbots, drafting emails, or generating code, a trained model runs in the backend to produce outputs from users’ inputs. However, the process of inference is more nuanced than “feed in a prompt and receive text.”

In this article, we will explore the two main stages of autoregressive LLM inference, namely as Prefill phase and the Decode phase. I will also delve into practical techniques for controlling output length, mitigating repetitiveness, and leveraging beam search.

By the end, you should have a better technical understanding of how LLMs generate text and how you can steer them towards desired behaviours.

Overview of Autoregressive Inference

Most recent LLMs (e.g. GPT-style models) are autoregressive, meaning they generate text one token at a time, conditioning on all previously generated tokens.

At a high level, the process has two phases:

  1. Prefill (Prompt Encoding): The model processes the user’s prompt in full, producing hidden states (and typically key-value caches) that summarise the prompt’s content.
  2. Decode (Autoregressive Generation): The model enters a loop, where at each step it takes the hidden state of the tokens generated so far, predicts a distribution over the next token, samples or selects one, appends it to the sequence, and updates its hidden state (via cached keys/values).

Understanding the distinction between these phases is crucial for optimising latency and for implementing decoding strategies that influence output quality.

Prefill Phase

When you supply a prompt, say, “Once upon a time, in a distant kingdom…” the model must convert each token in that input text into internal representations. This involves several steps:

Tokenisation

The input text is split into units called tokens, which are then encoded to obtain corresponding numbers (called token_id) for the substrings from a predefined vocabulary.

Embedding Lookup

Each token_id is mapped to a dense vector (token embedding), and a positional embedding is added to capture the positional information of each token in the sequence.

Attention Mechanism

The embedded tokens are processed by a stack of transformer blocks, each comprising:

Key-Value Cache Construction

In decoder-only architectures, each transformer layer outputs key and value matrices used in future attention computations. Caching these avoids recomputing attention from scratch at every decode step.

Once the prefill is done, the model holds a “snapshot” of hidden states and caches representing the entire input sequence.

Practical Considerations

Latency vs. Throughput

Batching

Memory Footprint

Decode Phase

After prefill, the model enters the autoregressive loop:

  1. Take the hidden state of the last token.
  2. Compute attention over cached key-value pairs.
  3. Predict the distribution over the vocabulary.
  4. Determine the next token (via greedy, top-k, nucleus sampling, or beam search).
  5. Append the token to the sequence.
  6. Update the cache with the new token’s key and value.

This loop continues until a stop condition is met which would commonly be an end-of-sequence token or a maximum token limit.

Output Control Techniques

Greedy Sampling

Greedy sampling basically chooses the next token based on the highest probability it can achieve at every step.

Beam Search

Unlike greedy sampling, Beam search maintains multiple paths in parallel, expanding the most promising sequences. This helps the model to overlook more than just the next maximising token. This improves coherence at the cost of computation.

Temperature and Top-k/Nucleus Sampling

Length Penalty and Max Tokens

To prevent overly long or short generations, users can set maximum token limits and apply length penalties to the generation score.

Repetition Penalty

Models often repeat phrases. A repetition penalty modifies the logits of already-used tokens to discourage reuse.

Conclusion

Understanding how LLMs perform inference is vital for building efficient and responsive applications. The prefill and decode phases involve different trade-offs, from latency and memory to sampling and quality. With the right strategies like caching, batching, and careful sampling, you can harness the full potential of LLMs for your application needs.

References