Metrics#

This page describes LangCheck’s metrics for evaluating LLMs.

Importing Metrics#

Inside the LangCheck package, metrics are first categorized by language. For example, langcheck.metrics.en contains all metrics for English text.

Tip

For English text, you can also directly import metrics from langcheck.metrics, which contains all English metrics and language-agnostic metrics:

# Short version
from langcheck.metrics import fluency
from langcheck.metrics import is_json_array

# Long version
from langcheck.metrics.en.reference_free_text_quality import fluency
from langcheck.metrics.text_structure import is_json_array

Within each language, metrics are further categorized by metric type. For example, langcheck.metrics.ja.reference_free_text_quality contains all Japanese, reference-free text quality metrics. However, you can also import metrics from langcheck.metrics.ja directly.

So, for Japanese text, you can import Japanese text metrics from langcheck.metrics.ja, and language-agnostic metrics from langcheck.metrics.

from langcheck.metrics.ja import fluency  # Japanese fluency metric
from langcheck.metrics import is_json_array  # Language-agnostic metric

Metric Types#

LangCheck metrics are categorized by metric type, which correspond to the kind of ground truth data that’s required.

Type of Metric	Examples	Languages
Reference-Free Text Quality Metrics	`toxicity(generated_outputs)` `sentiment(generated_outputs)` `ai_disclaimer_similarity(generated_outputs)`	EN, JA
Reference-Based Text Quality Metrics	`semantic_similarity(generated_outputs, reference_outputs)` `rouge2(generated_outputs, reference_outputs)`	EN, JA
Source-Based Text Quality Metrics	`factual_consistency(generated_outputs, sources)`	EN, JA
Text Structure Metrics	`is_float(generated_outputs, min=0, max=None)` `is_json_object(generated_outputs)`	All Languages

Reference-Free Text Quality Metrics#

Reference-free metrics require no ground truth, and directly evaluate the quality of the generated text by itself.

An example metric is toxicity(), which directly evaluates the level of toxicity in some text as a score between 0 and 1.

Reference-Based Text Quality Metrics#

Reference-based metrics require a ground truth output (a “reference”) to compare LLM outputs against. For example, in a Q&A application, you might have human written answers as references.

An example metric is semantic_similarity(), which computes the semantic similarity between the LLM-generated text and the reference text as a score between -1 and 1.

Source-Based Text Quality Metrics#

Source-based metrics require a “source” text. Sources are inputs, but references are outputs. For example, in a Q&A application, the source might be relevant documents that are concatenated to the question and passed into the LLM’s context window (this is called Retrieval Augmented Generation or RAG).

An example metric is factual_consistency(), which compares the factual consistency between the LLM’s generated text and the source text as a score between 0 and 1.

Text Structure Metrics#

Text structure metrics validate the format of the text (e.g. is the text valid JSON, an email address, an integer in a specified range). Compared to other metric types which can return floats, these metrics only return 0 or 1.

An example metric is is_json_object(), which checks if the LLM-generated text is a valid JSON object.

Computing Metrics with OpenAI Models#

Several text quality metrics are computed using a model (e.g. toxicity, sentiment, semantic_similarity, factual_consistency). By default, LangCheck will download and use a model that can run locally on your machine (often from HuggingFace) so that the metric function works with no additional setup.

However, if you have an OpenAI API key, you can also configure these metrics to use an OpenAI model, which may provide more accurate results for more complex use cases. Here are some examples of how to do this:

import os
from langcheck.metrics.en import semantic_similarity

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]

# Option 1: Set OPENAI_API_KEY as an environment variable
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'
similarity_value = semantic_similarity(generated_outputs,
                                       reference_outputs,
                                       model_type='openai')

# Option 2: Pass in an OpenAI client directly
from openai import OpenAI

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')
similarity_value = semantic_similarity(generated_outputs,
                                       reference_outputs,
                                       model_type='openai',
                                       openai_client=client)

Or, if you’re using Azure OpenAI, here are some examples of how to use it:

import os
from langcheck.metrics.en import semantic_similarity

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]

# Option 1: Set the AZURE_OPENAI_KEY, OPENAI_API_VERSION, and
# AZURE_OPENAI_ENDPOINT environment variables
os.environ["AZURE_OPENAI_KEY"] = 'YOUR_AZURE_OPENAI_KEY'
os.environ["OPENAI_API_VERSION"] = 'YOUR_OPENAI_API_VERSION'
os.environ["AZURE_OPENAI_ENDPOINT"] = 'YOUR_AZURE_OPENAI_ENDPOINT'

# When using the Azure API type, you need to pass in your model's
# deployment name
similarity_value = semantic_similarity(
    generated_outputs,
    reference_outputs,
    model_type='azure_openai',
    openai_args={'model': 'YOUR_EMBEDDING_MODEL_DEPLOYMENT_NAME'})

# Option 2: Pass in an AzureOpenAI client directly
from openai import AzureOpenAI

client = AzureOpenAI(api_key='YOUR_AZURE_OPENAI_KEY',
                     api_version='YOUR_OPENAI_API_VERSION',
                     azure_endpoint='YOUR_AZURE_OPENAI_ENDPOINT')
similarity_value = semantic_similarity(
    generated_outputs,
    reference_outputs,
    model_type='azure_openai',
    openai_client=client,
    openai_args={'model': 'YOUR_EMBEDDING_MODEL_DEPLOYMENT_NAME'})