langcheck.metrics.zh#

class langcheck.metrics.zh.HanLPTokenizer[source]#

Bases: _ChineseTokenizer

HanLP based Tokenizer for Chinese.

The default tokenizer to calculate rouge score based on HanLP.

Note

HanLP is an actively maintained NLP library that was initially developed for Chinese language processing. We run HanLP’s single-task models using HanLP’s pipeline mode, because: 1. HanLP has both multi-task models and single-task models. The multi-task models are quite large (generally 400MB+), whereas the single-task models are only ~40MB. So, we use a single-task model by default. 2. HanLP’s pipeline mode allows processing of long texts (i.e. many sentences) efficiently in parallel. It splits long text into sentences and applies the tokenizer to the sentences in parallel.

langcheck.metrics.zh.factual_consistency(generated_outputs: List[str] | str, sources: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None][source]#

Calculates the factual consistency between the generated outputs and the sources. This metric takes on float values between [0, 1], where 0 means that the output is not at all consistent with the source text, and 1 means that the output is fully consistent with the source text. (NOTE: when using an EvalClient, the factuality scores are either 0.0, 0.5, or 1.0. The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where the ‘unieval-fact’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. This function wraps en_factual_consistency() using the translation model Helsinki-NLP/opus-mt-zh-en to translate the Chinese texts to English before computing the factual consistency scores. This is because the UniEval-fact model is trained on English text.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • sources – The source text(s), one string per generated output

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.zh.rouge1(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-1 scores between the generated (single tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.zh.rouge2(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-2 scores between the generated outputs and the reference outputs. It evaluates the overlap of bigrams (two adzhcent tokens) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 is no overlap and 1 is complete overlap.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.zh.rougeL(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, *, tokenizer: Tokenizer | None = None) MetricValue[float][source]#

Calculates the F1 metrics of the ROUGE-L scores between the generated outputs and the reference outputs. It evaluates the longest common subsequence (LCS) between the generated outputs and the reference outputs. This metric takes on float values between [0, 1], where 0 means that the LCS is empty and 1 means that the reference and generated outputs are the same.

Ref:

google-research/google-research

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

An MetricValue object

langcheck.metrics.zh.semantic_similarity(generated_outputs: List[str] | str, reference_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float][source]#

Calculates the semantic similarities between the generated outputs and the reference outputs. The similarities are computed as the cosine similarities between the generated and reference embeddings. This metric takes on float values between [-1, 1], but typically ranges between 0 and 1 where 0 is minimum similarity and 1 is maximum similarity. (NOTE: when using OpenAI embeddings, the cosine similarities tend to be skewed quite heavily towards higher numbers.)

We currently support two embedding model types:

1. The ‘local’ type, where the ‘BAAI/bge-base-zh-v1.5’ model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. This model will return cosine similarities around 0.3 when sentences have no semantic similarity. Sentences # NOQA: E501 with missing punctuation/different punctuation (one is declarative sentence, one is question) would lower the value to 0.2 ~ 0.3.

2. The EvalClient type, where you can use a similarlity scorer returned by the given EvalClient. The scorer is typically implemented using the embedding APIs of cloud services. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/tasks/sentence-similarity https://www.sbert.net/docs/usage/semantic_textual_similarity.html

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • reference_outputs – The reference output(s)

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.zh.sentiment(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None][source]#

Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using an EvalClient, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where the IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.zh.toxicity(generated_outputs: List[str] | str, prompts: List[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None][source]#

Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using an EvalClient, the toxicity scores are in steps of 0.25. The score may also be None if it could not be computed.)

We currently support two evaluation model types:

1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (alibaba-pai/pai-bert-base-zh-llm-risk-detection) is a risky detection model for LLM generated content released by Alibaba group.

2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.

Ref:

https://huggingface.co/alibaba-pai/pai-bert-base-zh-llm-risk-detection

Parameters:
  • generated_outputs – The model generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

  • eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’

Returns:

An MetricValue object

langcheck.metrics.zh.xuyaochen_report_readability(generated_outputs: List[str] | str, prompts: List[str] | str | None = None) MetricValue[float][source]#

Calculates the readability scores of generated outputs introduced in “中文年报可读性”(Chinese annual report readability). This metric calculates average words per sentence as r1, average of the sum of the numbers of adverbs and coordinating conjunction words in a sentence in given generated outputs as r2, then, refer to the Fog Index that combine r1 with r2 by arithmetic mean as the final outputs. This function uses HanLP Tokenizer and POS at the same time, POS in CTB style https://hanlp.hankcs.com/docs/annotations/pos/ctb.html. The lower the score is, the better the readability. The score is mainly influenced by r1, the average number of words in sentences.

Ref:

Refer Chinese annual report readability: measurement and test Link: https://www.tandfonline.com/doi/full/10.1080/21697213.2019.1701259

Parameters:
  • generated_outputs – A list of model generated outputs to evaluate

  • prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.

Returns:

A list of scores