In the rapidly advancing field of AI, we are witnessing exponential advancements that are reshaping technology at a remarkable pace. The emergence of ChatGPT marked a pivotal moment in AI, sparking a global conversation about language models capable of performing tasks with exceptional proficiency, far beyond what was previously imagined !!
Artificial Intelligence is not a substitute for human intelligence; it is a tool to amplify human creativity and ingenuity. — Fei Fei Li
This article explores types of use cases that you can build using LLMs and evaluation strategies for understanding LLMs’ performance.
A quick recap for someone wondering what an LLM is
LLM is an advanced version of a language model with billions of parameters, which would be trained on vast amounts of text data. LLMs excel at both generative tasks, such as writing coherent paragraphs, and predictive tasks, like filling in missing words in sentences or classifying texts.
Applications that integrate LLMs are often referred to as LLM-powered applications, enabling capabilities such as chatbots like ChatGPT, automated content generation, and agentic solutions. When we build such an application, it is very critical to ensure it is stable and tested thoroughly.
LLM outputs are non-deterministic, meaning the model can generate different responses to the same input. Given this behaviour, evaluation criteria should focus on understanding both its capabilities and potential risks. For example, if we are building a Question & Answering System, it is essential to check whether the chatbot is giving correct answers. Some other evaluation criteria could include — Are the answers helpful? Is the tone okay? Is the model hallucinating? Is the chatbot being fair? etc.
There are several ways to evaluate LLMs, including manual evaluation and automated methods. One approach is to assemble a set of domain experts to create a gold-standard dataset and compare the model’s output against the experts’ expectations. However, this process can be costly and may not be feasible when building solutions for certain specialised domains, such as healthcare.
Let’s explore different approaches for gathering evaluation datasets and the strategies you can use to assess LLM performance.
Ways to generate Evaluation datasets:
By combining all the evaluation datasets, we can group the final evaluation strategy into three main categories:
Group 1: Requires Ground Truth Dataset — Measures Correctness
These evaluation methods are focused on assessing the correctness of the model’s outputs, typically by comparing them to a predefined “ground truth” or expected outcome.
2. Generative Quality Metrics
In generative tasks, such as text generation or machine translation, the model is expected to produce fluent and accurate outputs. Common metrics include:
3. Semantic Similarity
These methods assess how semantically similar the model-generated output is to the reference output at the embedding level. The most common approach is to measure the cosine similarity between the embeddings of the generated and reference texts. A higher cosine similarity indicates that the two pieces of text are more similar in meaning.
4. LLM-as-a-Judge
This method involves the LLM acting as the judge or evaluator of the potential LLM’s responses, even without a strict ground truth.
For example, the model could assess its output based on predefined guidelines or criteria, such as coherence, relevance, and alignment with the expected output.
Sometimes a bigger and better LLM is used as a judge to evaluate the candidate LLM’s outputs.
Group 2: Does Not Require Ground Truth — Measures Relevance
These methods evaluate how relevant, useful, or meaningful the model’s output is, without requiring an exact “correct” answer to compare against.
The focus is on the relevance and operability of the model’s outputs.
In conclusion, evaluating LLM applications requires careful examination by combining methods that assess both the correctness and relevance of LLM’s outputs. By using techniques such as ground-truth-based metrics, functional testing, and self-assessment by another LLM, we can gain a comprehensive understanding of the model’s performance. Thorough evaluation is crucial to ensure that LLMs not only generate accurate outputs but also remain useful, relevant, and reliable in real-world applications.