langcheck.metrics.ja.reference_based_text_quality#

langcheck.metrics.ja.reference_based_text_quality.answer_correctness(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str, eval_model: EvalClient) MetricValue[float | None][source]#

Calculates the correctness of the generated outputs. This metric takes on float values of either 0.0 (Incorrect), 0.5 (Partially Correct), or 1.0 (Correct). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s)

  • eval_model – The EvalClient instance used for the evaluation

Returns:

A MetricValue object

langcheck.metrics.ja.reference_based_text_quality.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-1 scores between the generated (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.reference_based_text_quality.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.reference_based_text_quality.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.reference_based_text_quality.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float][source]#

Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity.

We currently support two embedding model types:

1. The ‘local’ type, where the ‘paraphrase-multilingual-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The EvalClient type, where you can use a similarlity scorer returned by the given EvalClient. The scorer is typically implemented using the embedding APIs of cloud services. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object