langcheck.metrics.en.pairwise_text_quality

langcheck.metrics.en.pairwise_text_quality#

langcheck.metrics.en.pairwise_text_quality.pairwise_comparison(generated_outputs_a: list[str] | str, generated_outputs_b: list[str] | str, prompts: list[str] | str, sources_a: list[str] | str | None = None, sources_b: list[str] | str | None = None, reference_outputs: list[str] | str | None = None, enforce_consistency: bool = True, calculated_confidence: bool = False, preference_data_path: str = 'en/confidence_estimating/preference_data_examples.jsonl', k: int = 5, n: int = 5, seed: int | None = None, eval_model: EvalClient | None = None) MetricValue[float | None][source]#

Calculates the pairwise comparison metric. This metric takes on float values of either 0.0 (Response A is better), 0.5 (Tie), or 1.0 (Response B is better). The score may also be None if it could not be computed.

We currently only support the evaluation based on an EvalClient.

Parameters:
  • generated_outputs_a – Model A’s generated output(s) to evaluate

  • generated_outputs_b – Model B’s generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s)

  • sources_a – The source text(s) for Model A’s generated output(s), default None

  • sources_b – The source text(s) for Model B’s generated output(s), default None

  • reference_outputs – The reference output(s), default None

  • enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.

  • calculated_confidence – When this is True, we will calculate a confidence score for the pairwise comparison metric. Default False.

  • preference_data_path – The relative path to preference data labeld by human annotators. Users should prepare a pool of preference annotations (e.g., 1000 examples) in advance to use this metric.

  • k – The number of examples of preference annotations

  • n – The number of simulated annotators

  • seed – The random seed for the simulated annotators

  • eval_model – The EvalClient instance used for the evaluation. This is marked as Optional so that it can follow the above arguments that have default values (for consistency with the other metrics), but this is in fact a required argument.

Returns:

An MetricValue object

langcheck.metrics.en.pairwise_text_quality.simulated_annotators(prompt_params: list[dict[str, str | None]], eval_model: EvalClient, preference_data_path: str = 'en/confidence_estimating/preference_data_examples.jsonl', k: int = 5, n: int = 5, seed: int | None = None) list[float | None][source]#

Compute a confidence score for the pairwise comparison metric based on the method Simulated Annotators proposed in the paper “Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement” (https://arxiv.org/abs/2407.18370)

Parameters:
  • prompt_params – The parameters used to populate the prompt template.

  • eval_model – The EvalClient instance used for the evaluation.

  • preference_data_path – The relative path to preference data labeled by human annotators. Users should prepare a pool of preference annotations (e.g., 1000 examples) in advance to use this metric.

  • k – The number of examples of preference annotations

  • n – The numbre of simulated annotators

  • seed – The random seed for selecting the few-shot examples

Returns:

A confidence score for the pairwise comparison metric