langcheck.metrics#

langcheck.metrics contains all of LangCheck’s evaluation metrics.

Since LangCheck has multi-lingual support, language-specific metrics are organized into sub-packages such as langcheck.metrics.en or langcheck.metrics.ja.

Tip

As a shortcut, all English and language-agnostic metrics are also directly accessible from langcheck.metrics. For example, you can directly run langcheck.metrics.sentiment() instead of langcheck.metrics.en.reference_free_text_quality.sentiment().

Additionally, langcheck.metrics.MetricValue is a shortcut for langcheck.metrics.metric_value.MetricValue.

There are several different types of metrics:

Type of Metric

Examples

Languages

Reference-Based Text Quality Metrics

toxicity(generated_outputs)

sentiment(generated_outputs)

ai_disclaimer_similarity(generated_outputs)

EN, JA

Reference-Free Text Quality Metrics

semantic_similarity(generated_outputs, reference_outputs)

rouge2(generated_outputs, reference_outputs)

EN, JA

Source-Based Text Quality Metrics

factual_consistency(generated_outputs, sources)

EN, JA

Text Structure Metrics

is_float(generated_outputs, min=0, max=None)

is_json_object(generated_outputs)

All Languages


class langcheck.metrics.MetricValue(metric_name: str, metric_values: List[NumericType], prompts: List[str] | None, generated_outputs: List[str], reference_outputs: List[str] | None, sources: List[str] | None, language: str | None)[source]#

Bases: Generic[NumericType]

A rich object that is the output of any langcheck.metrics function.

all() bool[source]#

Equivalent to all(metric_value.metric_values). This is mostly useful for binary metric functions.

any() bool[source]#

Equivalent to any(metric_value.metric_values). This is mostly useful for binary metric functions.

generated_outputs: List[str]#
histogram(jupyter_mode: str = 'inline')[source]#

Shows an interactive histogram of all data points in MetricValue. Intended to be used in a Jupyter notebook.

This is a convenience function that calls langcheck.plot.histogram().

language: str | None#
metric_name: str#
metric_values: List[NumericType]#
prompts: List[str] | None#
reference_outputs: List[str] | None#
scatter(jupyter_mode: str = 'inline')[source]#

Shows an interactive scatter plot of all data points in MetricValue. Intended to be used in a Jupyter notebook.

This is a convenience function that calls langcheck.plot.scatter().

sources: List[str] | None#
to_df() DataFrame[source]#

Returns a DataFrame of metric values for each data point.

langcheck.metrics.ai_disclaimer_similarity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, ai_disclaimer_phrase: str = "I don't have personal opinions, emotions, or consciousness.", embedding_model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the degree to which the LLM’s output contains a disclaimer that it is an AI. This is calculated by computing the semantic similarity between the generated outputs and a reference AI disclaimer phrase; by default, this phrase is “I don’t have personal opinions, emotions, or consciousness.”, but you can also pass in a custom phrase. Please refer to semantic_similarity() for details on the typical output ranges and the supported embedding model types.

Parameters:
  • generated_outputs – A list of model generated outputs to evaluate

  • prompts – An optional list of prompts used to generate the outputs. Prompts are not evaluated and only used as metadata.

  • ai_disclaimer_phrase – Reference AI disclaimer phrase, default “I don’t have personal opinions, emotions, or consciousness.”

  • embedding_model_type – The type of embedding model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.Embedding.create function, default None

Returns:

An MetricValue object

langcheck.metrics.contains_all_strings(generated_outputs: List[str] | str, strings: List[str], case_sensitive: bool = False, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs contain all strings in of a given list. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • strings – A list of strings to match

  • case_sensitive – Whether to match case sensitively or not, default False

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.contains_any_strings(generated_outputs: List[str] | str, strings: List[str], case_sensitive: bool = False, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs contain any strings in a given list. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • strings – A list of strings to match

  • case_sensitive – Whether to match case sensitively or not, default to False.

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.contains_regex(generated_outputs: List[str] | str, regex: str, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs partially contain a given regular expression. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • regex – The regular expression to match

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.exact_match(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if the generated outputs exact matches with the reference outputs. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.factual_consistency(generated_outputs: List[str] | str, sources: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the factual consistency between the generated outputs and the sources. The factual consistency score for one generated output is computed as the average of the per-sentence consistencies of the generated output with the source text. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when uing the OpenAI model, the factuality score for each sentence is either 0.0, 0.5, or 1.0.)

We currently support two model types:

1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • sources – The source text(s), one string per generated output

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None

Returns:

An MetricValue object

langcheck.metrics.flesch_kincaid_grade(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the readability of generated outputs using the Flesch-Kincaid Grade Level metric. This metric takes on float values between [-3.40, ∞), but typically ranges between 0 and 12 (corresponding to U.S. grade levels), where lower scores mean the text is easier to read.

Like the Flesch Reading Ease Score, this metric is based on the number of sentences, words, and syllables in the text.

Ref:

https://apps.dtic.mil/sti/citations/ADA006655

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.flesch_reading_ease(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the readability of generated outputs using the Flesch Reading Ease Score. This metric takes on float values between (-∞, 121.22], but typically ranges between 0 and 100, where higher scores mean the text is easier to read.

The score is based on the number of sentences, words, and syllables in the text. See “How to Write Plain English” by Rudolf Franz Flesch for more details.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.fluency(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency.

We currently support two model types: 1. The ‘local’ type, where the Parrot fluency model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None

Returns:

An MetricValue object

langcheck.metrics.is_float(generated_outputs: List[str] | str, min: float | None = None, max: float | None = None, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs can be parsed as floating point numbers, optionally within a min/max range. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • min – The optional minimum valid float

  • max – The optional maximum valid float

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.is_int(generated_outputs: List[str] | str, domain: Iterable[int] | Container[int] | None = None, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs can be parsed as integers, optionally within a domain of integers like range(1, 11) or {1, 3, 5}. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • domain – The optional domain of valid integers

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.is_json_array(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs can be parsed as JSON arrays. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.is_json_object(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs can be parsed as JSON objects. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.matches_regex(generated_outputs: List[str] | str, regex: str, prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs fully match a given regular expression. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • regex – The regular expression to match

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-1 scores between the generated outputs and the reference outputs. It evaluates the overlap of unigrams (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, embedding_model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity. (NOTE: when using OpenAI embeddings, the cosine similarities tend to be skewed quite heavily towards higher numbers.)

We currently support two embedding model types:

1. The ‘local’ type, where the ‘all-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The ‘openai’ type, where we use OpenAI’s ‘text-embedding-ada-002’ model by default (this is configurable). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.

Ref:

https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html https://openai.com/blog/new-and-improved-embedding-model

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • embedding_model_type – The type of embedding model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.Embedding.create function, default None

Returns:

An MetricValue object

langcheck.metrics.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using the OpenAI model, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive).)

We currently support two model types: 1. The ‘local’ type, where the Twitter-roBERTa-base model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None

Returns:

An MetricValue object

langcheck.metrics.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#

Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity.

We currently support two model types: 1. The ‘local’ type, where the Detoxify model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’

  • openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None

Returns:

An MetricValue object

langcheck.metrics.validation_fn(generated_outputs: List[str] | str, valid_fn: Callable[[str], bool], prompts: List[str] | str | None = None) MetricValue[int][source]#

Checks if generated outputs are valid according to an arbitrary function. This metric takes on binary 0 or 1 values.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • valid_fn – A function that takes a single string and returns a bool determining whether the string is valid or not. The function can also raise an exception on failure.

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object