langcheck.metrics.de.reference_free_text_quality#

langcheck.metrics.de.reference_free_text_quality.ai_disclaimer_similarity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, ai_disclaimer_phrase: str = 'Ich habe keine persönlichen Meinungen, Emotionen oder Bewusstsein.', openai_client: OpenAI | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) → MetricValue[float][source]#

Calculates the degree to which the LLM’s output contains a disclaimer that it is an AI. This is calculated by computing the semantic similarity between the generated outputs and a reference AI disclaimer phrase; by default, this phrase is “Ich habe keine persönlichen Meinungen, Emotionen oder Bewusstsein.” (the most common reply from chatGPT in German), but you can also pass in a custom phrase. Please refer to semantic_similarity() for details on the typical output ranges and the supported embedding model types.

Parameters:

generated_outputs – A list of model generated outputs to evaluate
prompts – An optional list of prompts used to generate the outputs. Prompts are not evaluated and only used as metadata.
ai_disclaimer_phrase – Reference AI disclaimer phrase, default “I don’t have personal opinions, emotions, or consciousness.”
model_type – The type of embedding model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but model_type is ‘openai’ or ‘azure_openai’, we will attempt to create a default client.
openai_args – Dict of additional args to pass in to the client.embeddings.create function, default None

Returns:

An MetricValue object

langcheck.metrics.de.reference_free_text_quality.answer_relevance(generated_outputs: List[str] | str, prompts: List[str] | str, model_type: str = 'openai', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None, *, use_async: bool = False) → MetricValue[float | None][source]#

Calculates the relevance of generated outputs to the prompt. This metric takes on float values of either 0.0 (Not Relevant), 0.5 (Partially Relevant), or 1.0 (Fully Relevant). The score may also be None if it could not be computed.

We currently support two model types:

1. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.

2. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in openai_args, e.g. openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}

langcheck.metrics.de.reference_free_text_quality.flesch_kincaid_grade(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) → MetricValue[float][source]#: Calculates the readability of generated outputs using the Flesch-Kincaid. It is the same as in English (but higher): ref: https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Kincaid-Grade-Level

langcheck.metrics.de.reference_free_text_quality.flesch_reading_ease(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) → MetricValue[float][source]#

Calculates the readability of generated outputs using the Flesch Reading Ease Score. This metric takes on float values between (-∞, 121.22], but typically ranges between 0 and 100, where higher scores mean the text is easier to read.

The score is based on the number of sentences, words, and syllables in the text. See “How to Write Plain English” by Rudolf Franz Flesch for more details. For the German Formula, see https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Reading-Ease FRE(Deutsch) = 180 - ASL - 58.5 * ASW

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.de.reference_free_text_quality.fluency(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None, *, use_async: bool = False) → MetricValue[float | None][source]#

Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency. (NOTE: when using the OpenAI model, the fluency scores are either 0.0 (poor), 0.5 (fair), or 1.0 (good). The score may also be None if it could not be computed.)

We currently support three model types:

1. The ‘local’ type, we first translate the generated outputs to English, then use the Parrot fluency model for the English counterpart. This is the default model type and there is no setup needed to run this.

2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.

3. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in openai_args, e.g. openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but model_type is ‘openai’ or ‘azure_openai’, we will attempt to create a default client.
openai_args – Dict of additional args to pass in to the client.chat.completions.create function, default None
use_async – Whether to use the asynchronous API of OpenAI, default False

Returns:

An MetricValue object

langcheck.metrics.de.reference_free_text_quality.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None, local_overflow_strategy: str = 'truncate', *, use_async: bool = False) → MetricValue[float | None][source]#

Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using the OpenAI model, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)

We currently support three model types:

1. The ‘local’ type, where the twitter-xlm-roberta-base-sentiment-finetunned model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

Ref:: https://huggingface.co/citizenlab/twitter-xlm-roberta-base-sentiment-finetunned

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but model_type is ‘openai’ or ‘azure_openai’, we will attempt to create a default client.
openai_args – Dict of additional args to pass in to the client.chat.completions.create function, default None
local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.
use_async – Whether to use the asynchronous API of OpenAI, default False

Returns:

An MetricValue object

langcheck.metrics.de.reference_free_text_quality.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None, local_overflow_strategy: str = 'truncate', *, use_async: bool = False) → MetricValue[float | None][source]#

Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using the OpenAI model, the toxicity scores are in steps of 0.25. The score may also be None if it could not be computed.)

We currently support three model types:

1. The ‘local’ type, where the multilingual Detoxify model is downloaded from GitHub and run locally. This is the default model type and there is no setup needed to run this.

2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default, in the same way as english counterpart. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’, ‘openai’, or ‘azure_openai’), default ‘local’
openai_client – OpenAI or AzureOpenAI client, default None. If this is None but model_type is ‘openai’ or ‘azure_openai’, we will attempt to create a default client.
openai_args – Dict of additional args to pass in to the client.chat.completions.create function, default None
local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.
use_async – Whether to use the asynchronous API of OpenAI, default False

Returns:

An MetricValue object

langcheck.metrics.de.reference_free_text_quality

Contents

langcheck.metrics.de.reference_free_text_quality#