In the rapidly advancing field of AI, we are witnessing exponential advancements that are reshaping technology at a remarkable pace. The emergence of ChatGPT marked a pivotal moment in AI, sparking a global conversation about language models capable of performing tasks with exceptional proficiency, far beyond what was previously imagined !!

Artificial Intelligence is not a substitute for human intelligence; it is a tool to amplify human creativity and ingenuity. — Fei Fei Li

This article explores types of use cases that you can build using LLMs and evaluation strategies for understanding LLMs’ performance.

A quick recap for someone wondering what an LLM is

LLM is an advanced version of a language model with billions of parameters, which would be trained on vast amounts of text data. LLMs excel at both generative tasks, such as writing coherent paragraphs, and predictive tasks, like filling in missing words in sentences or classifying texts.

What can we build using LLMs?

Applications that integrate LLMs are often referred to as LLM-powered applications, enabling capabilities such as chatbots like ChatGPT, automated content generation, and agentic solutions. When we build such an application, it is very critical to ensure it is stable and tested thoroughly.

LLM outputs are non-deterministic, meaning the model can generate different responses to the same input. Given this behaviour, evaluation criteria should focus on understanding both its capabilities and potential risks. For example, if we are building a Question & Answering System, it is essential to check whether the chatbot is giving correct answers. Some other evaluation criteria could include — Are the answers helpful? Is the tone okay? Is the model hallucinating? Is the chatbot being fair? etc.

How do we evaluate our LLM apps?

There are several ways to evaluate LLMs, including manual evaluation and automated methods. One approach is to assemble a set of domain experts to create a gold-standard dataset and compare the model’s output against the experts’ expectations. However, this process can be costly and may not be feasible when building solutions for certain specialised domains, such as healthcare.

Let’s explore different approaches for gathering evaluation datasets and the strategies you can use to assess LLM performance.

Ways to generate Evaluation datasets:

Using Publicly Available Benchmark Datasets — Leverage established benchmark datasets designed for evaluating LLMs.
Utilising Existing Business Data — Usage of internal datasets relevant to your specific domain.
Generating Synthetic Datasets — Create artificial data using existing source documents.
Creating Test Cases — Design targeted test cases to assess model responses based on predefined evaluation criteria.

By combining all the evaluation datasets, we can group the final evaluation strategy into three main categories:

Happy Path
Covers expected use cases where the application functions as intended, ensuring the model performs well in typical scenarios.
Edge Cases
Includes unexpected user interactions that fall outside the project scope. For example, how does the model handle off-topic questions or sensitive queries?
Adversarial Scenarios
Focuses on attempts by malicious users to exploit the application, such as extracting harmful content or manipulating the LLM’s responses.

LLM Evaluation Methods and Metrics

Group 1: Requires Ground Truth Dataset — Measures Correctness

These evaluation methods are focused on assessing the correctness of the model’s outputs, typically by comparing them to a predefined “ground truth” or expected outcome.

Predictive Quality Metrics
Predictive quality metrics are primarily used for tasks like classification, where the goal is to predict labels based on input data. Common metrics in this category include:

Accuracy: The percentage of correct predictions out of all predictions made.
Precision, Recall, and F1-score: Precision measures the proportion of correct positive predictions out of all positive predictions, while recall calculates the proportion of correct positive predictions out of all actual positives. The F1-score is the harmonic mean of precision and recall, providing a balance between the two.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): This measures the model’s ability to distinguish between classes across different thresholds.

2. Generative Quality Metrics
In generative tasks, such as text generation or machine translation, the model is expected to produce fluent and accurate outputs. Common metrics include:

BLEU (Bilingual Evaluation Understudy): Measures the overlap of n-grams between the generated text and the reference text. It’s widely used in machine translation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures the overlap of n-grams, words, or word sequences between the model-generated output and a set of reference outputs. It’s often used in tasks like summarisation.

3. Semantic Similarity
These methods assess how semantically similar the model-generated output is to the reference output at the embedding level. The most common approach is to measure the cosine similarity between the embeddings of the generated and reference texts. A higher cosine similarity indicates that the two pieces of text are more similar in meaning.

Cosine Similarity: Measures the cosine of the angle between two vectors (representing text in an embedding space). A cosine similarity of 1 indicates identical vectors, while 0 indicates orthogonal and -1 indicates opposite (completely dissimilar) vectors.

4. LLM-as-a-Judge
This method involves the LLM acting as the judge or evaluator of the potential LLM’s responses, even without a strict ground truth.

For example, the model could assess its output based on predefined guidelines or criteria, such as coherence, relevance, and alignment with the expected output.

Sometimes a bigger and better LLM is used as a judge to evaluate the candidate LLM’s outputs.

Group 2: Does Not Require Ground Truth — Measures Relevance

These methods evaluate how relevant, useful, or meaningful the model’s output is, without requiring an exact “correct” answer to compare against.

Sentiment Analysis
Sentiment analysis evaluates the emotional tone or sentiment behind a piece of text. LLMs can be evaluated by how accurately they identify sentiments such as positive, negative, or neutral within generated content. While this doesn’t require ground truth in the traditional sense, it relies on established sentiment categories to determine relevance and accuracy.
Regular Expressions
Regular expressions can be used to evaluate specific patterns or structures within the generated text, such as verifying the presence of phone numbers, email addresses, or dates in a response. This type of evaluation does not depend on ground truth, but rather on whether the output meets certain predefined patterns or formats.
LLM-as-a-Judge (Relevance Evaluation)
Here, the LLM evaluates its response for relevance, usefulness, and alignment with the task. For instance, the model could assess whether its answer is on-topic and provides valuable information. This approach doesn’t require a ground truth but relies on predefined relevance criteria.
Functional Testing
Functional testing ensures that LLM-generated outputs are executable and usable in real-world contexts. Examples include:

Executing Code: Running generated code (e.g., Python) to verify it works as intended.
SQL Query Execution: Testing generated SQL queries by executing them against a database to ensure they return the expected results.
APIs and Workflows: Verifying that generated API calls or workflows function as intended in a practical application.

The focus is on the relevance and operability of the model’s outputs.

In conclusion, evaluating LLM applications requires careful examination by combining methods that assess both the correctness and relevance of LLM’s outputs. By using techniques such as ground-truth-based metrics, functional testing, and self-assessment by another LLM, we can gain a comprehensive understanding of the model’s performance. Thorough evaluation is crucial to ensure that LLMs not only generate accurate outputs but also remain useful, relevant, and reliable in real-world applications.

References:

LLM evaluation course by Evidently AI https://www.evidentlyai.com/llm-evaluations-course
What are LLMs? by IBM https://www.ibm.com/think/topics/large-language-models