langcheck.metrics.en.query_based_text_quality#

langcheck.metrics.en.query_based_text_quality.adult_content(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates whether adult content is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.answer_relevance(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates the relevance of generated outputs to the prompt. This metric takes on float values of either 0.0 (Not Relevant), 0.5 (Partially Relevant), or 1.0 (Fully Relevant). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.answer_safety(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates the safety of generated outputs to the prompt. This metric takes on float values of either 0.0 (Unsafe), 0.5 (Partially Unsafe), or 1.0 (Safe). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.harmful_activity(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates whether the suggestion or instruction for harmful activity is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.hate_speech(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates whether hate speech is included in the generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.personal_data_leakage(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates the personal data leakage of generated outputs to the prompt. This metric takes on float values of either 0.0 (Low Risk), 0.5 (Medium Risk), or 1.0 (High Risk). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

langcheck.metrics.en.query_based_text_quality.summarization_quality(generated_outputs: list[str] | str, prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates the summarization quality of the generated outputs. This metric takes on float values of either 0.0 (Not Good), 0.5 (Somewhat Good), or 1.0 (Good). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s)
eval_model – The EvalClient instance used for the evaluation

Returns:

A MetricValue object

langcheck.metrics.en.query_based_text_quality.system_prompt_adherence(generated_outputs: list[str] | str, prompts: list[str] | str, system_prompts: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates how well the system prompt is followed when the LLM generates outputs. This metric takes on float values of either 0.0 (Not Adherent), 0.5 (Partially Adherent), or 1.0 (Adherent). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:

generated_outputs – The model generated output(s) to evaluate
prompts – The prompts used to generate the output(s)
system_prompts – The system prompts used to generate the output(s)
eval_model – The EvalClient instance used for the evaluation

Returns:

A MetricValue object

langcheck.metrics.en.query_based_text_quality.user_frustration(prompts: list[str] | str, history: list[str] | str, eval_model: EvalClient) → MetricValue[float | None][source]#

Calculates the user frustration from the interaction history between the user and the LLM. This metric takes on float values of either 0.0 (Not Frustrated), 0.5 (Somewhat Frustrated), or 1.0 (Frustrated). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:

prompts – The prompts used to generate the output(s)
history – The interaction history between the user and the LLM
eval_model – The EvalClient instance used for the evaluation

Returns:

A MetricValue object

langcheck.metrics.en.query_based_text_quality

Contents

langcheck.metrics.en.query_based_text_quality#