LLMs have revolutionised natural language understanding and generation. Whether you are building chatbots, drafting emails, or generating code, a trained model runs in the backend to produce outputs from users’ inputs. However, the process of inference is more nuanced than “feed in a prompt and receive text.”

In this article, we will explore the two main stages of autoregressive LLM inference, namely as Prefill phase and the Decode phase. I will also delve into practical techniques for controlling output length, mitigating repetitiveness, and leveraging beam search.

By the end, you should have a better technical understanding of how LLMs generate text and how you can steer them towards desired behaviours.

Overview of Autoregressive Inference

Most recent LLMs (e.g. GPT-style models) are autoregressive, meaning they generate text one token at a time, conditioning on all previously generated tokens.

At a high level, the process has two phases:

Prefill (Prompt Encoding): The model processes the user’s prompt in full, producing hidden states (and typically key-value caches) that summarise the prompt’s content.
Decode (Autoregressive Generation): The model enters a loop, where at each step it takes the hidden state of the tokens generated so far, predicts a distribution over the next token, samples or selects one, appends it to the sequence, and updates its hidden state (via cached keys/values).

Understanding the distinction between these phases is crucial for optimising latency and for implementing decoding strategies that influence output quality.

Prefill Phase

When you supply a prompt, say, “Once upon a time, in a distant kingdom…” the model must convert each token in that input text into internal representations. This involves several steps:

Tokenisation

The input text is split into units called tokens, which are then encoded to obtain corresponding numbers (called token_id) for the substrings from a predefined vocabulary.

Embedding Lookup

Each token_id is mapped to a dense vector (token embedding), and a positional embedding is added to capture the positional information of each token in the sequence.

Attention Mechanism

The embedded tokens are processed by a stack of transformer blocks, each comprising:

Self-attention enables each token to attend to all others in the input sequence, capturing contextual relationships.
Feed-forward networks apply nonlinear transformations to each token.

Key-Value Cache Construction

In decoder-only architectures, each transformer layer outputs key and value matrices used in future attention computations. Caching these avoids recomputing attention from scratch at every decode step.

Once the prefill is done, the model holds a “snapshot” of hidden states and caches representing the entire input sequence.

Practical Considerations

Latency vs. Throughput

Prefill is a one-time cost per prompt.
Decode occurs once per generated token.
For long completions, caching substantially reduces compute; for short completions, the prefill overhead might dominate.

Batching

Multiple prompts can be processed in parallel during prefill.
However, batching increases the chance that fast requests are delayed by slow ones.
Optimal batch sizing balances GPU utilisation and end-to-end latency.

Memory Footprint

Caches grow with sequence length, hidden size, and number of layers.
For high-throughput systems, careful memory management is essential, especially with long prompts or large batch sizes.

Decode Phase

After prefill, the model enters the autoregressive loop:

Take the hidden state of the last token.
Compute attention over cached key-value pairs.
Predict the distribution over the vocabulary.
Determine the next token (via greedy, top-k, nucleus sampling, or beam search).
Append the token to the sequence.
Update the cache with the new token’s key and value.

This loop continues until a stop condition is met which would commonly be an end-of-sequence token or a maximum token limit.

Output Control Techniques

Greedy Sampling

Greedy sampling basically chooses the next token based on the highest probability it can achieve at every step.

Beam Search

Unlike greedy sampling, Beam search maintains multiple paths in parallel, expanding the most promising sequences. This helps the model to overlook more than just the next maximising token. This improves coherence at the cost of computation.

Temperature and Top-k/Nucleus Sampling

Temperature controls the randomness of sampling. This is applied as a scaling factor for the output logits before the softmax operation to control generation to be more deterministic or non-deterministic. Higher temperature values scale the distribution more and introduce a non-deterministic nature for next token selection and vice versa.
Top-k sampling restricts choices to the k most likely tokens and randomly samples out of the top-k tokens.
Nucleus (top-p) sampling chooses from the smallest set of tokens whose cumulative probability exceeds p.

Length Penalty and Max Tokens

To prevent overly long or short generations, users can set maximum token limits and apply length penalties to the generation score.

Repetition Penalty

Models often repeat phrases. A repetition penalty modifies the logits of already-used tokens to discourage reuse.

Conclusion

Understanding how LLMs perform inference is vital for building efficient and responsive applications. The prefill and decode phases involve different trade-offs, from latency and memory to sampling and quality. With the right strategies like caching, batching, and careful sampling, you can harness the full potential of LLMs for your application needs.

References

LLM inference: https://huggingface.co/learn/llm-course/chapter1/8?fw=pt#inference-with-llms