Human-machine interaction
source

In the rapidly advancing field of AI, we are witnessing exponential advancements that are reshaping technology at a remarkable pace. The emergence of ChatGPT marked a pivotal moment in AI, sparking a global conversation about language models capable of performing tasks with exceptional proficiency, far beyond what was previously imagined !!

Artificial Intelligence is not a substitute for human intelligence; it is a tool to amplify human creativity and ingenuity. — Fei Fei Li

This article explores types of use cases that you can build using LLMs and evaluation strategies for understanding LLMs’ performance.

A quick recap for someone wondering what an LLM is

LLM is an advanced version of a language model with billions of parameters, which would be trained on vast amounts of text data. LLMs excel at both generative tasks, such as writing coherent paragraphs, and predictive tasks, like filling in missing words in sentences or classifying texts.

What can we build using LLMs?

Applications that integrate LLMs are often referred to as LLM-powered applications, enabling capabilities such as chatbots like ChatGPT, automated content generation, and agentic solutions. When we build such an application, it is very critical to ensure it is stable and tested thoroughly.

LLM outputs are non-deterministic, meaning the model can generate different responses to the same input. Given this behaviour, evaluation criteria should focus on understanding both its capabilities and potential risks. For example, if we are building a Question & Answering System, it is essential to check whether the chatbot is giving correct answers. Some other evaluation criteria could include — Are the answers helpful? Is the tone okay? Is the model hallucinating? Is the chatbot being fair? etc.

How do we evaluate our LLM apps?

There are several ways to evaluate LLMs, including manual evaluation and automated methods. One approach is to assemble a set of domain experts to create a gold-standard dataset and compare the model’s output against the experts’ expectations. However, this process can be costly and may not be feasible when building solutions for certain specialised domains, such as healthcare.

Let’s explore different approaches for gathering evaluation datasets and the strategies you can use to assess LLM performance.

Ways to generate Evaluation datasets:

By combining all the evaluation datasets, we can group the final evaluation strategy into three main categories:

  1. Happy Path
    Covers expected use cases where the application functions as intended, ensuring the model performs well in typical scenarios.
  2. Edge Cases
    Includes unexpected user interactions that fall outside the project scope. For example, how does the model handle off-topic questions or sensitive queries?
  3. Adversarial Scenarios
    Focuses on attempts by malicious users to exploit the application, such as extracting harmful content or manipulating the LLM’s responses.

LLM Evaluation Methods and Metrics

Group 1: Requires Ground Truth Dataset — Measures Correctness

These evaluation methods are focused on assessing the correctness of the model’s outputs, typically by comparing them to a predefined “ground truth” or expected outcome.

  1. Predictive Quality Metrics
    Predictive quality metrics are primarily used for tasks like classification, where the goal is to predict labels based on input data. Common metrics in this category include:

2. Generative Quality Metrics
In generative tasks, such as text generation or machine translation, the model is expected to produce fluent and accurate outputs. Common metrics include:

3. Semantic Similarity
These methods assess how semantically similar the model-generated output is to the reference output at the embedding level. The most common approach is to measure the cosine similarity between the embeddings of the generated and reference texts. A higher cosine similarity indicates that the two pieces of text are more similar in meaning.

Source

4. LLM-as-a-Judge
This method involves the LLM acting as the judge or evaluator of the potential LLM’s responses, even without a strict ground truth.

For example, the model could assess its output based on predefined guidelines or criteria, such as coherence, relevance, and alignment with the expected output.

Source
Sometimes a bigger and better LLM is used as a judge to evaluate the candidate LLM’s outputs.

Group 2: Does Not Require Ground Truth — Measures Relevance

These methods evaluate how relevant, useful, or meaningful the model’s output is, without requiring an exact “correct” answer to compare against.

  1. Sentiment Analysis
    Sentiment analysis evaluates the emotional tone or sentiment behind a piece of text. LLMs can be evaluated by how accurately they identify sentiments such as positive, negative, or neutral within generated content. While this doesn’t require ground truth in the traditional sense, it relies on established sentiment categories to determine relevance and accuracy.
  2. Regular Expressions
    Regular expressions can be used to evaluate specific patterns or structures within the generated text, such as verifying the presence of phone numbers, email addresses, or dates in a response. This type of evaluation does not depend on ground truth, but rather on whether the output meets certain predefined patterns or formats.
  3. LLM-as-a-Judge (Relevance Evaluation)
    Here, the LLM evaluates its response for relevance, usefulness, and alignment with the task. For instance, the model could assess whether its answer is on-topic and provides valuable information. This approach doesn’t require a ground truth but relies on predefined relevance criteria.
  4. Functional Testing
    Functional testing ensures that LLM-generated outputs are executable and usable in real-world contexts. Examples include:

The focus is on the relevance and operability of the model’s outputs.

In conclusion, evaluating LLM applications requires careful examination by combining methods that assess both the correctness and relevance of LLM’s outputs. By using techniques such as ground-truth-based metrics, functional testing, and self-assessment by another LLM, we can gain a comprehensive understanding of the model’s performance. Thorough evaluation is crucial to ensure that LLMs not only generate accurate outputs but also remain useful, relevant, and reliable in real-world applications.

References:

  1. LLM evaluation course by Evidently AI https://www.evidentlyai.com/llm-evaluations-course
  2. What are LLMs? by IBM https://www.ibm.com/think/topics/large-language-models