langcheck.metrics.zh.reference_free_text_quality#
- langcheck.metrics.zh.reference_free_text_quality.sentiment(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local') MetricValue[float | None] [source]#
Calculates the sentiment scores of generated outputs. This metric takes on float values between [0, 1], where 0 is negative sentiment and 1 is positive sentiment. (NOTE: when using an EvalClient, the sentiment scores are either 0.0 (negative), 0.5 (neutral), or 1.0 (positive). The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where the IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment model is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
- Returns:
An
MetricValue
object
- langcheck.metrics.zh.reference_free_text_quality.toxicity(generated_outputs: list[str] | str, prompts: list[str] | str | None = None, eval_model: str | EvalClient = 'local', eval_prompt_version: str = 'v2') MetricValue[float | None] [source]#
Calculates the toxicity scores of generated outputs. This metric takes on float values between [0, 1], where 0 is low toxicity and 1 is high toxicity. (NOTE: when using an EvalClient, the toxicity scores are in steps of 0.25. The score may also be None if it could not be computed.)
We currently support two evaluation model types:
1. The ‘local’ type, where a model file is downloaded from HuggingFace and run locally. This is the default model type and there is no setup needed to run this. The model (alibaba-pai/pai-bert-base-zh-llm-risk-detection) is a risky detection model for LLM generated content released by Alibaba group.
2. The EvalClient type, where you can use an EvalClient typically implemented with an LLM. The implementation details are explained in each of the concrete EvalClient classes.
- Parameters:
generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
eval_model – The type of model to use (‘local’ or the EvalClient instance used for the evaluation). default ‘local’
eval_prompt_version – The version of the eval prompt to use when the EvalClient is used. The default version is ‘v2’ (latest).
- Returns:
An
MetricValue
object
- langcheck.metrics.zh.reference_free_text_quality.xuyaochen_report_readability(generated_outputs: list[str] | str, prompts: list[str] | str | None = None) MetricValue[float] [source]#
Calculates the readability scores of generated outputs introduced in “中文年报可读性”(Chinese annual report readability). This metric calculates average words per sentence as r1, average of the sum of the numbers of adverbs and coordinating conjunction words in a sentence in given generated outputs as r2, then, refer to the Fog Index that combine r1 with r2 by arithmetic mean as the final outputs. This function uses HanLP Tokenizer and POS at the same time, POS in CTB style https://hanlp.hankcs.com/docs/annotations/pos/ctb.html. The lower the score is, the better the readability. The score is mainly influenced by r1, the average number of words in sentences.
- Ref:
Refer Chinese annual report readability: measurement and test Link: https://www.tandfonline.com/doi/full/10.1080/21697213.2019.1701259
- Parameters:
generated_outputs – A list of model generated outputs to evaluate
prompts – The prompts used to generate the output(s). Prompts are optional metadata and not used to calculate the metric.
- Returns:
A list of scores