langcheck.metrics.en#
Tip
As a shortcut, all English metrics are also directly accessible from langcheck.metrics. For example, you can directly import langcheck.metrics.sentiment instead of langcheck.metrics.en.sentiment.
- langcheck.metrics.en.ai_disclaimer_similarity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, ai_disclaimer_phrase: str = "I don't have personal opinions, emotions, or consciousness.", openai_client: OpenAI | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the degree to which the LLM’s output contains a disclaimer that it is an AI. This is calculated by computing the semantic similarity between the generated outputs and a reference AI disclaimer phrase; by default, this phrase is “I don’t have personal opinions, emotions, or consciousness.”, but you can also pass in a custom phrase. Please refer to
semantic_similarity()for details on the typical output ranges and the supported embedding model types.- Parameters:
generated_outputs – A list of model generated outputs to evaluate
prompts – An optional list of prompts used to generate the outputs. Prompts are not evaluated and only used as metadata.
ai_disclaimer_phrase – Reference AI disclaimer phrase, default “I don’t have personal opinions, emotions, or consciousness.”
model_type – The type of embedding model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.embeddings.createfunction, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.en.answer_relevance(generated_outputs: List[str] | str, prompts: List[str] | str, model_type: str = 'openai', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the relevance of generated outputs to the prompt. This metric takes on float values of either 0.0 (Not Relevant), 0.5 (Partially Relevant), or 1.0 (Fully Relevant). The score may also be None if it could not be computed.
We currently support two model types:
1. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
2. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}
- langcheck.metrics.en.context_relevance(sources: List[str] | str, prompts: List[str] | str, model_type: str = 'openai', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the relevance of the sources to the prompts. This metric takes on float values between [0, 1], where 0 means that the source text is not at all relevant to the prompt, and 1 means that the source text is fully relevant to the prompt.
We currently support two model types:
1. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
2. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Parameters:
sources – The source text(s), one string per prompt
prompts – The prompt(s)
model_type – The type of model to use (‘openai’ or ‘azure_openai’), default ‘openai’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None, we will attempt to create a default client.
openai_args – Dict of additional args to pass in to the
client.chat.completions.createfunction, default None
- langcheck.metrics.en.factual_consistency(generated_outputs: List[str] | str, sources: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the factual consistency between the generated outputs and the sources. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when using the OpenAI model, the factuality scores are either 0.0, 0.5, or 1.0. The score may also be None if it could not be computed.)
We currently support three model types:
1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Parameters:
generated_outputs – The model generated output(s) to evaluate
sources – The source text(s), one string per generated output
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.chat.completions.createfunction, default None
- Returns:
An MetricValue object
- langcheck.metrics.en.flesch_kincaid_grade(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the readability of generated outputs using the Flesch-Kincaid Grade Level metric. This metric takes on float values between [-3.40, ∞), but typically ranges between 0 and 12 (corresponding to U.S. grade levels), where lower scores mean the text is easier to read.
Like the Flesch Reading Ease Score, this metric is based on the number of sentences, words, and syllables in the text.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.en.flesch_reading_ease(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the readability of generated outputs using the Flesch Reading Ease Score. This metric takes on float values between (-∞, 121.22], but typically ranges between 0 and 100, where higher scores mean the text is easier to read.
The score is based on the number of sentences, words, and syllables in the text. See “How to Write Plain English” by Rudolf Franz Flesch for more details.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.en.fluency(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency. (NOTE: when using the OpenAI model, the fluency scores are either 0.0 (poor), 0.5 (fair), or 1.0 (good). The score may also be None if it could not be computed.)
We currently support three model types:
1. The ‘local’ type, where the Parrot fluency model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.chat.completions.createfunction, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.en.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-1 scores between the generated outputs and the reference outputs. It evaluates the overlap of unigrams (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.en.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.en.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.en.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, Any] | None = None) MetricValue[float][source]#
Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity.
We currently support three embedding model types:
1. The ‘local’ type, where the ‘all-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘text-embedding-3-small’ model by default (this is configurable). See this page on setting up the OpenAI API key.
3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Ref:
https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html https://openai.com/blog/new-embedding-models-and-api-updates
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of embedding model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.embeddings.createfunction, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.en.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using the OpenAI model, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)
We currently support three model types:
1. The ‘local’ type, where the Twitter-roBERTa-base model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.chat.completions.createfunction, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.en.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None) MetricValue[float | None][source]#
Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using the OpenAI model, the toxicity scores are in steps of 0.25. The score may also be None if it could not be computed.)
We currently support three model types:
1. The ‘local’ type, where the Detoxify model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.
3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in
openai_args, e.g.openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but
model_typeis ‘openai’ or ‘azure_openai’, we will attempt to create a default client.openai_args – Dict of additional args to pass in to the
client.chat.completions.createfunction, default None
- Returns:
An
MetricValueobject