langcheck.metrics.en.pairwise_text_quality#

langcheck.metrics.en.pairwise_text_quality.pairwise_comparison(generated_outputs_a: List[str] | str, generated_outputs_b: List[str] | str, prompts: List[str] | str, sources_a: List[str] | str | None = None, sources_b: List[str] | str | None = None, reference_outputs: List[str] | str | None = None, enforce_consistency: bool = True, model_type: str = 'openai', openai_client: OpenAI | None = None, openai_args: Dict[str, str] | None = None, *, use_async: bool = False) MetricValue[float | None][source]#

Calculates the pairwise comparison metric. This metric takes on float values of either 0.0 (Response A is better), 0.5 (Tie), or 1.0 (Response B is better). The score may also be None if it could not be computed.

We currently support two model types:

1. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default. While the model you use is configurable, please make sure to use one that supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling). See this page for examples on setting up the OpenAI API key.

2. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify your model deployment to use in openai_args, e.g. openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}

Ref:

Our prompt is similar to the prompt used in https://arxiv.org/abs/2306.05685

Parameters:
  • generated_outputs_a – Model A’s generated output(s) to evaluate

  • generated_outputs_b – Model B’s generated output(s) to evaluate

  • prompts – The prompts used to generate the output(s)

  • sources_a – The source text(s) for Model A’s generated output(s), default None

  • sources_b – The source text(s) for Model B’s generated output(s), default None

  • reference_outputs – The reference output(s), default None

  • enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.

  • model_type – The type of model to use (‘openai’, or ‘azure_openai’), default ‘openai’

  • openai_client – OpenAI or AzureOpenAI client, default None. If this is None, we will attempt to create a default client.

  • openai_args – Dict of additional args to pass in to the client.chat.completions.create function, default None

  • use_async – Whether to use the asynchronous API of OpenAI, default False

Returns:

An MetricValue object