langcheck.metrics.ja#

class langcheck.metrics.ja.JanomeTokenizer[source]#

Bases: _JapaneseTokenizer

Janome based Tokenizer for Japanese.

The default tokenizer to calculate rouge score base on Janome.

Note

The advantage of using Janome is that it is a pure Python library and introduces no additional dependencies. On the other hand, it takes more time to parse sentences than a MeCab -based tokenizer. Specifically, it takes seconds every time when constructing this class since the Janome tokenizer loads the entire dictionary during initialization. If you are processing large data, consider setting up MeCab and using the MeCabTokenizer.

class langcheck.metrics.ja.MeCabTokenizer[source]#

Bases: _JapaneseTokenizer

An optional tokenizer to calculate rouge score base on MeCab.

Note

The advantage of using MeCab is that the core implementation is written in a compiled language and runs much faster than Janome. If you are processing large data, consider setting up MeCab and using the MeCabTokenizer. On the other hand, it takes more effort to install it on some environments and may not work. Please refer to the official page if the Python wrapper, mecab-python3, does not work in your environment.

langcheck.metrics.ja.answer_relevance(generated_outputs: List[str] | str, prompts: List[str] | str, eval_model: EvalClient) MetricValue[float | None][source]#

Calculates the relevance of generated outputs to the prompt. This metric takes on float values of either 0.0 (Not Relevant), 0.5 (Partially Relevant), or 1.0 (Fully Relevant). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.ja.context_relevance(sources: List[str] | str, prompts: List[str] | str, eval_model: EvalClient) MetricValue[float | None][source]#

Calculates the relevance of the sources to the prompts. This metric takes on float values between [0, 1], where 0 means that the source text is not at all relevant to the prompt, and 1 means that the source text is fully relevant to the prompt.

We currently only support the evaluation based on an EvalClient.

Parameters:
  • sources – The source text(s), one string per prompt

  • prompts – The prompt(s)

  • eval_model – The EvalClient instance used for the evaluation

langcheck.metrics.ja.factual_consistency(generated_outputs: List[str] | str, sources: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None][source]#

Calculates the factual consistency between the generated outputs and the sources. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when using an EvalClient, the factuality scores are either 0.0, 0.5, or 1.0. The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. This function wraps en_factual_consistency() using the translation model Helsinki-NLP/opus-mt-ja-en to translate the Japanese texts to English before computing the factual consistency scores. This is because the UniEval-fact model is trained on English text.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • sources – The source text(s), one string per generated output

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.ja.fluency(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate') MetricValue[float | None][source]#

Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency. (NOTE: when using an EvalClient, the fluency scores are either 0.0 (poor), 0.5 (fair), or 1.0 (good). The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (liwii/fluency-score-classification-ja) is a fine-tuned model based on line-corporation/line-distilbert-base-japanese model.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/line-corporation/line-distilbert-base-japanese https://huggingface.co/liwii/fluency-score-classification-ja

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

  • local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.

Returns:

An MetricValue object

langcheck.metrics.ja.pairwise_comparison(generated_outputs_a: List[str] | str, generated_outputs_b: List[str] | str, prompts: List[str] | str, sources_a: List[str] | str | None = None, sources_b: List[str] | str | None = None, reference_outputs: List[str] | str | None = None, enforce_consistency: bool = True, eval_model: EvalClient | None = None) MetricValue[float | None][source]#

Calculates the pairwise comparison metric. This metric takes on float values of either 0.0 (Response A is better), 0.5 (Tie), or 1.0 (Response B is better). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:
  • generated_outputs_a – Model A’s generated output(s) to evaluate

  • generated_outputs_b – Model B’s generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s)

  • sources_a – The source text(s) for Model A’s generated output(s), default None

  • sources_b – The source text(s) for Model B’s generated output(s), default None

  • reference_outputs – The reference output(s), default None

  • enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.

  • eval_model – The EvalClient instance used for the evaluation. This is marked as Optional so that it can follow the above arguments that have default values (for consistency with the other metrics), but this is in fact a required argument.

Returns:

An MetricValue object

langcheck.metrics.ja.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-1 scores between the generated (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float][source]#

Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity.

We currently support two embedding model types:

1. The ‘local’ type, where the ‘paraphrase-multilingual-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The EvalClient type, where you can use a similarlity scorer returned by the given EvalClient. The scorer is typically implemented using the embedding APIs of cloud services. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.ja.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate') MetricValue[float | None][source]#

Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using an EvalClient, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where the Twitter-roBERTa-base-sentiment-multilingual model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

  • local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.

Returns:

An MetricValue object

langcheck.metrics.ja.tateishi_ono_yamada_reading_ease(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the readability of generated Japanese outputs using the reading ease score introduced in “日本文の読みやすさの評価式 (A Computer Readability Formula of Japanese Texts for Machine Scoring)”. This metric takes on float values between (-∞, ∞), but in the paper it is reported that the average & the standard deviation of the scores obtained for 77 texts used for the experiment are 50 and 10 respectively. Higher scores mean the text is easier to read.

The score is based on the number of “run”s, which are sequences of characters with the same type (hiragana, katakana, kanji… etc). See the original paper for details.

Ref:

https://www.jstage.jst.go.jp/article/nihongokyoiku/158/0/158_49/_pdf/-char/ja (Japanese) https://ipsj.ixsq.nii.ac.jp/ej/?action=pages_view_main&active_action=repository_view_main_item_detail&item_id=37773&item_no=1&page_id=13&block_id=8 (Japanese) https://aclanthology.org/C88-2135/ (English)

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.ja.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local', local_overflow_strategy: str = 'truncate') MetricValue[float | None][source]#

Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using an EvalClient, the toxicity scores are in steps of 0.25. The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (Alnusjaponica/toxicity-score-multi-classification) is a fine-tuned model based on line-corporation/line-distilbert-base-japanese model.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/line-corporation/line-distilbert-base-japanese https://huggingface.co/Alnusjaponica/toxicity-score-multi-classification

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

  • local_overflow_strategy – The strategy to handle the inputs that are too long for the local model. The supported strategies are ‘nullify’, ‘truncate’, and ‘raise’. If ‘nullify’, the outputs that are too long will be assigned a score of None. If ‘truncate’, the outputs that are too long will be truncated. If ‘raise’, an error will be raised when the outputs are too long. The default value is ‘nullify’.

Returns:

An MetricValue object