LLMs have revolutionised natural language understanding and generation. Whether you are building chatbots, drafting emails, or generating code, a trained model runs in the backend to produce outputs from users’ inputs. However, the process of inference is more nuanced than “feed in a prompt and receive text.”
In this article, we will explore the two main stages of autoregressive LLM inference, namely as Prefill phase and the Decode phase. I will also delve into practical techniques for controlling output length, mitigating repetitiveness, and leveraging beam search.
By the end, you should have a better technical understanding of how LLMs generate text and how you can steer them towards desired behaviours.
Most recent LLMs (e.g. GPT-style models) are autoregressive, meaning they generate text one token at a time, conditioning on all previously generated tokens.
At a high level, the process has two phases:
Understanding the distinction between these phases is crucial for optimising latency and for implementing decoding strategies that influence output quality.
When you supply a prompt, say, “Once upon a time, in a distant kingdom…” the model must convert each token in that input text into internal representations. This involves several steps:
The input text is split into units called tokens, which are then encoded to obtain corresponding numbers (called token_id) for the substrings from a predefined vocabulary.
Each token_id is mapped to a dense vector (token embedding), and a positional embedding is added to capture the positional information of each token in the sequence.
The embedded tokens are processed by a stack of transformer blocks, each comprising:
In decoder-only architectures, each transformer layer outputs key and value matrices used in future attention computations. Caching these avoids recomputing attention from scratch at every decode step.
Once the prefill is done, the model holds a “snapshot” of hidden states and caches representing the entire input sequence.
After prefill, the model enters the autoregressive loop:
This loop continues until a stop condition is met which would commonly be an end-of-sequence token or a maximum token limit.
Greedy sampling basically chooses the next token based on the highest probability it can achieve at every step.
Unlike greedy sampling, Beam search maintains multiple paths in parallel, expanding the most promising sequences. This helps the model to overlook more than just the next maximising token. This improves coherence at the cost of computation.
To prevent overly long or short generations, users can set maximum token limits and apply length penalties to the generation score.
Models often repeat phrases. A repetition penalty modifies the logits of already-used tokens to discourage reuse.
Understanding how LLMs perform inference is vital for building efficient and responsive applications. The prefill and decode phases involve different trade-offs, from latency and memory to sampling and quality. With the right strategies like caching, batching, and careful sampling, you can harness the full potential of LLMs for your application needs.