langcheck.metrics.ja.pairwise_text_quality#
- langcheck.metrics.ja.pairwise_text_quality.pairwise_comparison(generated_outputs_a: list[str] | str, generated_outputs_b: list[str] | str, prompts: list[str] | str, sources_a: list[str] | str | None = None, sources_b: list[str] | str | None = None, reference_outputs: list[str] | str | None = None, enforce_consistency: bool = True, eval_model: EvalClient | None = None) MetricValue[float | None] [source]#
Calculates the pairwise comparison metric. This metric takes on float values of either 0.0 (Response A is better), 0.5 (Tie), or 1.0 (Response B is better). The score may also be None if it could not be computed.
We currently only support the evaluation based on an EvalClient.
- Parameters:
generated_outputs_a – Model A’s generated output(s) to evaluate
generated_outputs_b – Model B’s generated output(s) to evaluate
prompts – The prompts used to generate the output(s)
sources_a – The source text(s) for Model A’s generated output(s), default None
sources_b – The source text(s) for Model B’s generated output(s), default None
reference_outputs – The reference output(s), default None
enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.
eval_model – The EvalClient instance used for the evaluation. This is marked as Optional so that it can follow the above arguments that have default values (for consistency with the other metrics), but this is in fact a required argument.
- Returns:
An MetricValue object