langcheck.metrics.ja#
- class langcheck.metrics.ja.JanomeTokenizer[source]#
Bases:
_JapaneseTokenizerJanome based Tokenizer for Japanese.
The default tokenizer to calculate rouge score base on Janome.
Note
The advantage of using Janome is that it is a pure Python library and introduces no additional dependencies. On the other hand, it takes more time to parse sentences than a MeCab -based tokenizer. Specifically, it takes seconds every time when constructing this class since the Janome tokenizer loads the entire dictionary during initialization. If you are processing large data, consider setting up MeCab and using the
MeCabTokenizer.
- class langcheck.metrics.ja.MeCabTokenizer[source]#
Bases:
_JapaneseTokenizerAn optional tokenizer to calculate rouge score base on MeCab.
Note
The advantage of using MeCab is that the core implementation is written in a compiled language and runs much faster than Janome. If you are processing large data, consider setting up MeCab and using the
MeCabTokenizer. On the other hand, it takes more effort to install it on some environments and may not work. Please refer to the official page if the Python wrapper, mecab-python3, does not work in your environment.
- langcheck.metrics.ja.factual_consistency(generated_outputs: List[str] | str, sources: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the factual consistency between the generated outputs and the sources. The factual consistency score for one generated output is computed as the average of the per-sentence consistencies of the generated output with the source text. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when uing the OpenAI model, the factuality score for each sentence is either 0.0, 0.5, or 1.0.)
We currently support two model types:
1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. This function wraps
en_factual_consistency()using the translation modelstaka/fugumt-ja-ento translate the Japanese texts to English before computing the factual consistency scores. This is because the UniEval-fact model is trained on English text.2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
sources – The source text(s), one string per generated output
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’
openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None
- Returns:
An MetricValue object
- langcheck.metrics.ja.fluency(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the fluency scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low fluency and 1 is high fluency.
We currently support two model types: 1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (liwii/fluency-score-classification-ja) is a fine-tuned model based on line-corporation/line-distilbert-base-japanese model. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default, in the same way as english counterpart. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.
- Ref:
https://huggingface.co/line-corporation/line-distilbert-base-japanese https://huggingface.co/liwii/fluency-score-classification-ja
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’
openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.ja.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-1 scores between the generated (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An MetricValue object
- langcheck.metrics.ja.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adjacent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An MetricValue object
- langcheck.metrics.ja.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#
Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An MetricValue object
- langcheck.metrics.ja.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, embedding_model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity. (NOTE: when using OpenAI embeddings, the cosine similarities tend to be skewed quite heavily towards higher numbers.)
We currently support two embedding model types:
1. The ‘local’ type, where the ‘paraphrase-multilingual-mpnet-base-v2’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The ‘openai’ type, where we use OpenAI’s ‘text-embedding-ada-002’ model by default (this is configurable). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.
- Ref:
https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html https://openai.com/blog/new-and-improved-embedding-model
- Parameters:
generated_outputs – The model generated output(s) to evaluate
reference_outputs – The reference output(s)
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
embedding_model_type – The type of embedding model to use (‘local’ or ‘openai’), default ‘local’
openai_args – Dict of additional args to pass in to the openai.Embedding.create function, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.ja.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using the OpenAI model, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive).)
We currently support two model types: 1. The ‘local’ type, where the Twitter-roBERTa-base-sentiment-multilingual model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’
openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None
- Returns:
An
MetricValueobject
- langcheck.metrics.ja.tateishi_ono_yamada_reading_ease(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#
Calculates the readability of generated Japanese outputs using the reading ease score introduced in “日本文の読みやすさの評価式 (A Computer Readability Formula of Japanese Texts for Machine Scoring)”. This metric takes on float values between (-∞, ∞), but in the paper it is reported that the average & the standard deviation of the scores obtained for 77 texts used for the experiment are 50 and 10 respectively. Higher scores mean the text is easier to read.
The score is based on the number of “run”s, which are sequences of characters with the same type (hiragana, katakana, kanji… etc). See the original paper for details.
- Ref:
https://www.jstage.jst.go.jp/article/nihongokyoiku/158/0/158_49/_pdf/-char/ja (Japanese) # NOQA E501 https://ipsj.ixsq.nii.ac.jp/ej/?action=pages_view_main&active_action=repository_view_main_item_detail&item_id=37773&item_no=1&page_id=13&block_id=8 (Japanese) # NOQA E501 https://aclanthology.org/C88-2135/ (English)
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
An
MetricValueobject
- langcheck.metrics.ja.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, model_type: str = 'local', openai_args: Dict[str, str] | None = None) MetricValue[float][source]#
Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity.
We currently support two model types: 1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (Alnusjaponica/toxicity-score-multi-classification) is a fine-tuned model based on line-corporation/line-distilbert-base-japanese model. 2. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default, in the same way as english counterpart. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See https://langcheck.readthedocs.io/en/latest/metrics.html#computing-metrics-with-openai-models # NOQA E501 for examples on setting up the OpenAI API key.
- Ref:
https://huggingface.co/line-corporation/line-distilbert-base-japanese https://huggingface.co/Alnusjaponica/toxicity-score-multi-classification
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
model_type – The type of model to use (‘local’ or ‘openai’), default ‘local’
openai_args – Dict of additional args to pass in to the openai.ChatCompletion.create function, default None
- Returns:
An
MetricValueobject