langcheck.metrics

langcheck.metrics#

langcheck.metrics contains all of LangCheck’s evaluation metrics.

Since LangCheck has multi-lingual support, language-specific metrics are organized into sub-packages such as langcheck.metrics.en or langcheck.metrics.ja.

Tip

As a shortcut, all English and language-agnostic metrics are also directly accessible from langcheck.metrics. For example, you can directly run langcheck.metrics.sentiment() instead of langcheck.metrics.en.reference_free_text_quality.sentiment().

Additionally, langcheck.metrics.MetricValue is a shortcut for langcheck.metrics.metric_value.MetricValue.

There are several different types of metrics:

Type of Metric	Examples	Languages
Reference-Based Text Quality Metrics	`toxicity(generated_outputs)` `sentiment(generated_outputs)` `ai_disclaimer_similarity(generated_outputs)`	EN, JA, DE, ZH
Reference-Free Text Quality Metrics	`semantic_similarity(generated_outputs, reference_outputs)` `rouge2(generated_outputs, reference_outputs)`	EN, JA, DE, ZH
Source-Based Text Quality Metrics	`factual_consistency(generated_outputs, sources)`	EN, JA, DE, ZH
Text Structure Metrics	`is_float(generated_outputs, min=0, max=None)` `is_json_object(generated_outputs)`	All Languages
Pairwise Text Quality Metrics	`pairwise_comparison(generated_outputs_a, generated_outputs_b, prompts)`	EN

Bases: Generic[NumericType]

A rich object that is the output of any langcheck.metrics function.

all() → bool[source]#: Equivalent to all(metric_value.metric_values). This is mostly useful for binary metric functions.

any() → bool[source]#: Equivalent to any(metric_value.metric_values). This is mostly useful for binary metric functions.

explanations: List[str | None] | None#

generated_outputs: List[str] | tuple[List[str], List[str]] | None#

histogram(jupyter_mode: str = 'inline') → None[source]#

Shows an interactive histogram of all data points in MetricValue. Intended to be used in a Jupyter notebook.

This is a convenience function that calls langcheck.plot.histogram().

property is_pairwise: bool#

language: str | None#

metric_name: str#

metric_values: List[NumericType]#

prompts: List[str] | None#

reference_outputs: List[str] | None#

scatter(jupyter_mode: str = 'inline') → None[source]#

Shows an interactive scatter plot of all data points in MetricValue. Intended to be used in a Jupyter notebook.

This is a convenience function that calls langcheck.plot.scatter().

sources: List[str] | tuple[List[str] | None, List[str] | None] | None#

to_df() → DataFrame[source]#: Returns a DataFrame of metric values for each data point.

langcheck.metrics.ai_disclaimer_similarity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, ai_disclaimer_phrase: str = "I don't have personal opinions, emotions, or consciousness.", eval_model: str | EvalClient = 'local') → MetricValue[float][source]#

Calculates the degree to which the LLM’s output contains a disclaimer that it is an AI. This is calculated by computing the semantic similarity between the generated outputs and a reference AI disclaimer phrase; by default, this phrase is “I don’t have personal opinions, emotions, or consciousness.”, but you can also pass in a custom phrase. Please refer to semantic_similarity() for details on the typical output ranges and the supported embedding model types.

Parameters:

generated_outputs – A list of model generated outputs to evaluate
prompts – An optional list of prompts used to generate the outputs. Prompts are not evaluated and only used as metadata.
ai_disclaimer_phrase – Reference AI disclaimer phrase, default “I don’t have personal opinions, emotions, or consciousness.”
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

langcheck.metrics

Contents

langcheck.metrics#