langcheck.metrics#
langcheck.metrics
contains all of LangCheck’s evaluation metrics.
Since LangCheck has multi-lingual support, language-specific metrics are organized into sub-packages such as langcheck.metrics.en
or langcheck.metrics.ja
.
Tip
As a shortcut, all English and language-agnostic metrics are also directly accessible from langcheck.metrics
. For example, you can directly run langcheck.metrics.sentiment()
instead of langcheck.metrics.en.reference_free_text_quality.sentiment()
.
Additionally, langcheck.metrics.MetricValue
is a shortcut for langcheck.metrics.metric_value.MetricValue
.
There are several different types of metrics:
Type of Metric |
Examples |
Languages |
---|---|---|
|
EN, JA, DE, ZH |
|
|
EN, JA, DE, ZH |
|
|
EN, JA |
|
|
EN, JA, DE, ZH |
|
|
All Languages |
|
|
EN |
- class langcheck.metrics.MetricValue(metric_name: str, metric_values: list[NumericType], metric_inputs: MetricInputs, explanations: list[str | None] | None, language: str | None)[source]#
Bases:
Generic
[NumericType
]A rich object that is the output of any langcheck.metrics function.
- all() bool [source]#
Equivalent to all(metric_value.metric_values). This is mostly useful for binary metric functions.
- any() bool [source]#
Equivalent to any(metric_value.metric_values). This is mostly useful for binary metric functions.
- explanations: list[str | None] | None#
- histogram(jupyter_mode: str = 'inline') None [source]#
Shows an interactive histogram of all data points in MetricValue. Intended to be used in a Jupyter notebook.
This is a convenience function that calls
langcheck.plot.histogram()
.
- property is_scatter_compatible: bool#
Checks if the metric value is compatible with the scatter plot method. It is only available for metric values with only non-pairwise metric values used from initial release (generated_outputs, prompts, reference_outputs and sources)
- language: str | None#
- metric_inputs: MetricInputs#
- metric_name: str#
- metric_values: list[NumericType]#
- scatter(jupyter_mode: str = 'inline') None [source]#
Shows an interactive scatter plot of all data points in MetricValue. Intended to be used in a Jupyter notebook.
This is a convenience function that calls
langcheck.plot.scatter()
.
- langcheck.metrics.adult_content(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates whether adult content is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.ai_disclaimer_similarity(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, ai_disclaimer_phrase: str = "I don't have personal opinions, emotions, or consciousness.", eval_model: str | EvalClient = 'local') MetricValue[float] [source]#
Calculates the degree to which the LLM’s output contains a disclaimer that it is an AI. This is calculated by computing the semantic similarity between the generated outputs and a reference AI disclaimer phrase; by default, this phrase is “I don’t have personal opinions, emotions, or consciousness.”, but you can also pass in a custom phrase. Please refer to
semantic_similarity()
for details on the typical output ranges and the supported embedding model types.- Parameters:
generated_outputs – A list of model generated outputs to evaluate
prompts – An optional list of prompts used to generate the outputs. Prompts are not evaluated and only used as metadata.
ai_disclaimer_phrase – Reference AI disclaimer phrase, default “I don’t have personal opinions, emotions, or consciousness.”
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
- Returns:
An
MetricValue
object
- langcheck.metrics.answer_correctness(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates the correctness of the generated outputs. This metric takes on float values of either 0.0 (Incorrect), 0.5 (Partially Correct), or 1.0 (Correct). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s)
eval_model – The EvalClient instance used for the evaluation
- Returns:
A
MetricValue
object
- langcheck.metrics.answer_relevance(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates the relevance of generated outputs to the prompt. This metric takes on float values of either 0.0 (Not Relevant), 0.5 (Partially Relevant), or 1.0 (Fully Relevant). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.answer_safety(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates the safety of generated outputs to the prompt. This metric takes on float values of either 0.0 (Unsafe), 0.5 (Partially Unsafe), or 1.0 (Safe). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.contains_all_strings(generated_outputs: list[str] | str, strings: list[str], case_sensitive: bool = False, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs contain all strings in of a given list. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
strings – A list of strings to match
case_sensitive – Whether to match case sensitively or not, default False
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.contains_any_strings(generated_outputs: list[str] | str, strings: list[str], case_sensitive: bool = False, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs contain any strings in a given list. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
strings – A list of strings to match
case_sensitive – Whether to match case sensitively or not, default to
False
.prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.contains_regex(generated_outputs: list[str] | str, regex: str, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs partially contain a given regular expression. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
regex – The regular expression to match
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.context_relevance(sources: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates the relevance of the sources to the prompts. This metric takes on float values between [0, 1], where 0 means that the source text is not at all relevant to the prompt, and 1 means that the source text is fully relevant to the prompt.
We currently only support the evaluation based on an EvalClient.
- Parameters:
sources – The source text(s), one string per prompt
prompts – The prompt(s)
eval_model – The EvalClient instance used for the evaluation
- langcheck.metrics.custom_evaluator(generated_outputs: list[str] | str | None, prompts: list[str] | str | None, sources: list[str] | str | None, reference_outputs: list[str] | str | None, eval_model: EvalClient, metric_name: str, score_map: dict[str, float], template_path: str, language: str, *, additional_inputs: dict[str, IndividualInputType] | None = None, additional_input_name_to_prompt_var_mapping: dict[str, str] | None = None) MetricValue[float | None] [source]#
Calculates the scores of a custom evaluator. The EvalClient will first assess the provided inputs using the prompt template, and then convert those assessments into scores using the score map.
The prompt template should be a Jinja2 file (file extension .j2) that specifies the criteria that an LLM (as configured in the Eval Client) should follow when evaluating an instance. The template is allowed to have placeholders for the following variables (NOTE: not all are required): - gen_output: The generated output - user_query: The prompt - src: The source text - ref_output: The reference output
By specifying additional inputs, the prompt template can be more flexible. The additional inputs should be passed as a dictionary, where the keys are the input names and the values are the corresponding values. The additional inputs can be mapped to variable names in the prompt template using the additional_input_name_to_prompt_var_mapping dictionary.
The prompt template should also specify the final available assessments for the LLM evaluator, e.g. “Good”, “Bad”, “Neutral”, etc. The score map should then map each of those available assessments to a numerical score. E.g. if the available assessments in the prompt template are “Good”, “Bad”, and “Neutral”, the score map should be something like:
score_map = {'Good': 1.0, 'Neutral': 0.5, 'Bad': 0.0}
NOTE: We have found that LLM models sometimes behave weirdly when the assessments are non-ascii characters (see citadel-ai/langcheck#84 as an example). So, we recommend making the final assessments ascii characters, even when the rest of the prompt template contains non-ascii characters (e.g. Japanese).
- Parameters:
generated_outputs – The model generated output(s)
prompts – The prompts used to generate the output(s)
sources – The source(s) of the generated output(s)
reference_outputs – The reference output(s)
eval_model – The EvalClient instance used for the evaluation
metric_name – The name of the metric
score_map – A dictionary mapping the evaluator’s assessments to scores
template_path – The path to the prompt template file. This should be a Jinja2 file (file extension .j2).
language – The language that the evaluator will use (‘en’, ‘ja’, or ‘de’)
additional_inputs – Additional inputs other than the standard ones.
additional_input_name_to_prompt_var_mapping – A dictionary that maps the additional input names to the variable names in the prompt template.
- Returns:
A MetricValue object
- langcheck.metrics.custom_pairwise_evaluator(generated_outputs_a: list[str] | str | None, generated_outputs_b: list[str] | str | None, prompts: list[str] | str | None, sources_a: list[str] | str | None, sources_b: list[str] | str | None, reference_outputs: list[str] | str | None, eval_model: EvalClient, metric_name: str, score_map: dict[str, float], template_path: str, language: str, enforce_consistency: bool = True) MetricValue[float | None] [source]#
Calculates the scores of a custom pairwise evaluator, where “pairwise” means that the Responses and/or Sources of two systems will be compared against each other. The EvalClient will first assess the provided inputs using the prompt template, and then convert those assessments into scores using the score map.
The prompt template should be a Jinja2 file (file extension .j2) that specifies the criteria that an LLM (as configured in the Eval Client) should follow when evaluating an instance. The template is allowed to have placeholders for the following variables (NOTE: not all are required): - gen_output_a: Model A’s generated output - gen_output_b: Model B’s generated output - user_query: The prompt - src_a: The source text for Model A - src_b: The source text for Model B - ref_output: The reference output
The prompt template should also specify the final available assessments for the LLM evaluator, e.g. “Response A”, “Response B”, “Tie”, etc. The score map should then map each of those available assessments to a numerical score. E.g. if the available assessments in the prompt template are “Response A”, “Response B”, and “Tie”, the score map should be something like:
score_map = {'Response A': 0.0, 'Response B': 1.0, 'Tie': 0.5}
NOTE: If enforce_consistency is True, please make sure that the score map is symmetric, in the sense that swapping Model A and Model B should result in inverse scores. See the code below for more details.
NOTE: We have found that LLM models sometimes behave weirdly when the assessments are non-ascii characters (see citadel-ai/langcheck#84 as an example). So, we recommend making the final assessments ascii characters, even when the rest of the prompt template contains non-ascii characters (e.g. Japanese).
- Parameters:
generated_outputs_a – Model A’s generated output(s)
generated_outputs_b – Model B’s generated output(s)
prompts – The prompts used to generate the output(s)
sources_a – The source(s) for Model A’s generated output(s)
sources_b – The source(s) for Model B’s generated output(s)
reference_outputs – The reference output(s)
eval_model – The EvalClient instance used for the evaluation
metric_name – The name of the metric
score_map – A dictionary mapping the evaluator’s assessments to scores
template_path – The path to the prompt template file. This should be a Jinja2 file (file extension .j2).
language – The language that the evaluator will use (‘en’, ‘ja’, or ‘de’)
enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.
- Returns:
A MetricValue object
- langcheck.metrics.exact_match(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if the generated outputs exact matches with the reference outputs. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.factual_consistency(generated_outputs: list[str] | str, sources: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None] [source]#
Calculates the factual consistency between the generated outputs and the sources. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when using an EvalClient, the factuality scores are either 0.0, 0.5, or 1.0. The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
sources – The source text(s), one string per generated output
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
- Returns:
An MetricValue object
- langcheck.metrics.flesch_kincaid_grade(generated_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the readability of generated outputs using the Flesch-Kincaid Grade Level metric. This metric takes on float values between [-3.40, ∞), but typically ranges between 0 and 12 (corresponding to U.S. grade levels), where lower scores mean the text is easier to read.
Like the Flesch Reading Ease Score, this metric is based on the number of sentences, words, and syllables in the text.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.flesch_reading_ease(generated_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the readability of generated outputs using the Flesch Reading Ease Score. This metric takes on float values between (-∞, 121.22], but typically ranges between 0 and 100, where higher scores mean the text is easier to read.
The score is based on the number of sentences, words, and syllables in the text. See “How to Write Plain English” by Rudolf Franz Flesch for more details.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.fluency(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate') MetricValue[float | None] [source]#
Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency. (NOTE: when using an EvalClient, the fluency scores are either 0.0 (poor), 0.5 (fair), or 1.0 (good). The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where the Parrot fluency model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.
- Returns:
An
MetricValue
object
- langcheck.metrics.harmful_activity(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates whether the suggestion or instruction for harmful activity is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.hate_speech(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates whether hate speech is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.is_float(generated_outputs: list[str] | str, min: float | None = None, max: float | None = None, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs can be parsed as floating point numbers, optionally within a min/max range. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
min – The optional minimum valid float
max – The optional maximum valid float
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.is_int(generated_outputs: list[str] | str, domain: Iterable[int] | Container[int] | None = None, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs can be parsed as integers, optionally within a domain of integers like range(1, 11) or {1, 3, 5}. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
domain – The optional domain of valid integers
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.is_json_array(generated_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs can be parsed as JSON arrays. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.is_json_object(generated_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs can be parsed as JSON objects. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.jailbreak_prompt(prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates whether jailbreak techniques are included in the prompts. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.matches_regex(generated_outputs: list[str] | str, regex: str, prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs fully match a given regular expression. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
regex – The regular expression to match
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.pairwise_comparison(generated_outputs_a: list[str] | str, generated_outputs_b: list[str] | str, prompts: list[str] | str, sources_a: list[str] | str | None = None, sources_b: list[str] | str | None = None, reference_outputs: list[str] | str | None = None, enforce_consistency: bool = True, calculated_confidence: bool = False, preference_data_path: str = 'en/confidence_estimating/preference_data_examples.jsonl', k: int = 5, n: int = 5, seed: int | None = None, eval_model: EvalClient | None = None) MetricValue[float | None] [source]#
Calculates the pairwise comparison metric. This metric takes on float values of either 0.0 (Response A is better), 0.5 (Tie), or 1.0 (Response B is better). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- Parameters:
generated_outputs_a – Model A’s generated output(s) to evaluate
generated_outputs_b – Model B’s generated output(s) to evaluate
prompts – The prompts used to generate the output(s)
sources_a – The source text(s) for Model A’s generated output(s), default None
sources_b – The source text(s) for Model B’s generated output(s), default None
reference_outputs – The reference output(s), default None
enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.
calculated_confidence – When this is True, we will calculate a confidence score for the pairwise comparison metric. Default False.
preference_data_path – The relative path to preference data labeld by human annotators. Users should prepare a pool of preference annotations (e.g., 1000 examples) in advance to use this metric.
k – The number of examples of preference annotations
n – The number of simulated annotators
seed – The random seed for the simulated annotators
eval_model – The EvalClient instance used for the evaluation. This is marked as Optional so that it can follow the above arguments that have default values (for consistency with the other metrics), but this is in fact a required argument.
- Returns:
An MetricValue object
- langcheck.metrics.personal_data_leakage(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) MetricValue[float | None] [source]#
Calculates the personal data leakage of generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.prompt_leakage(generated_outputs: list[str] | str, system_prompts: list[str] | str, eval_model: EvalClient, eval_prompt_version: str = 'v2') MetricValue[float | None] [source]#
Calculates the severity of prompt leakage in the generated outputs. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- langcheck.metrics.rouge1(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the F1 metrics of the ROUGE-1 scores between the generated outputs and the reference outputs. It evaluates the overlap of unigrams (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.rouge2(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.rougeL(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object
- langcheck.metrics.semantic_similarity(generated_outputs: list[str] | str, reference_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float] [source]#
Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity.
We currently support two embedding model types:
1. The ‘local’ type, where the ‘all-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use a similarlity scorer returned by the given EvalClient. The scorer is typically implemented using the embedding APIs of cloud services. The implementation details are explained in each of the concrete EvalClient classes.
- Ref:
https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
- Returns:
An
MetricValue
object
- langcheck.metrics.sentiment(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate') MetricValue[float | None] [source]#
Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using an EvalClient, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where the Twitter-roBERTa-base model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.
- Returns:
An
MetricValue
object
- langcheck.metrics.toxicity(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate', eval_prompt_version: str = 'v2') MetricValue[float | None] [source]#
Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using an EvalClient, the toxicity scores are either 0.0 (nontoxic), or 1.0 (toxic). The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where the Detoxify model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.
eval_prompt_version – The version of the eval prompt to use when the EvalClient is used. The default version is ‘v2’ (latest).
- Returns:
An
MetricValue
object
- langcheck.metrics.validation_fn(generated_outputs: list[str] | str, valid_fn: Callable[[str], bool], prompts: list[str] | str | None = None) MetricValue[int] [source]#
Checks if generated outputs are valid according to an arbitrary function. This metric takes on binary 0 or 1 values.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
valid_fn – A function that takes a single string and returns a bool determining whether the string is valid or not. The function can also raise an exception on failure.
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValue
object