langcheck.metrics.custom_text_quality

langcheck.metrics.custom_text_quality#

langcheck.metrics.custom_text_quality.custom_evaluator(generated_outputs: list[str] | str | None, prompts: list[str] | str | None, sources: list[str] | str | None, reference_outputs: list[str] | str | None, eval_model: EvalClient, metric_name: str, score_map: dict[str, float], template_path: str, language: str, *, additional_inputs: dict[str, IndividualInputType] | None = None, additional_input_name_to_prompt_var_mapping: dict[str, str] | None = None) MetricValue[float | None][source]#

Calculates the scores of a custom evaluator. The EvalClient will first assess the provided inputs using the prompt template, and then convert those assessments into scores using the score map.

The prompt template should be a Jinja2 file (file extension .j2) that specifies the criteria that an LLM (as configured in the Eval Client) should follow when evaluating an instance. The template is allowed to have placeholders for the following variables (NOTE: not all are required): - gen_output: The generated output - user_query: The prompt - src: The source text - ref_output: The reference output

By specifying additional inputs, the prompt template can be more flexible. The additional inputs should be passed as a dictionary, where the keys are the input names and the values are the corresponding values. The additional inputs can be mapped to variable names in the prompt template using the additional_input_name_to_prompt_var_mapping dictionary.

The prompt template should also specify the final available assessments for the LLM evaluator, e.g. “Good”, “Bad”, “Neutral”, etc. The score map should then map each of those available assessments to a numerical score. E.g. if the available assessments in the prompt template are “Good”, “Bad”, and “Neutral”, the score map should be something like: score_map = {'Good': 1.0, 'Neutral': 0.5, 'Bad': 0.0}

NOTE: We have found that LLM models sometimes behave weirdly when the assessments are non-ascii characters (see citadel-ai/langcheck#84 as an example). So, we recommend making the final assessments ascii characters, even when the rest of the prompt template contains non-ascii characters (e.g. Japanese).

Parameters:
  • generated_outputs – The model generated output(s)

  • prompts – The prompts used to generate the output(s)

  • sources – The source(s) of the generated output(s)

  • reference_outputs – The reference output(s)

  • eval_model – The EvalClient instance used for the evaluation

  • metric_name – The name of the metric

  • score_map – A dictionary mapping the evaluator’s assessments to scores

  • template_path – The path to the prompt template file. This should be a Jinja2 file (file extension .j2).

  • language – The language that the evaluator will use (‘en’, ‘ja’, or ‘de’)

  • additional_inputs – Additional inputs other than the standard ones.

  • additional_input_name_to_prompt_var_mapping – A dictionary that maps the additional input names to the variable names in the prompt template.

Returns:

A MetricValue object

langcheck.metrics.custom_text_quality.custom_pairwise_evaluator(generated_outputs_a: list[str] | str | None, generated_outputs_b: list[str] | str | None, prompts: list[str] | str | None, sources_a: list[str] | str | None, sources_b: list[str] | str | None, reference_outputs: list[str] | str | None, eval_model: EvalClient, metric_name: str, score_map: dict[str, float], template_path: str, language: str, enforce_consistency: bool = True) MetricValue[float | None][source]#

Calculates the scores of a custom pairwise evaluator, where “pairwise” means that the Responses and/or Sources of two systems will be compared against each other. The EvalClient will first assess the provided inputs using the prompt template, and then convert those assessments into scores using the score map.

The prompt template should be a Jinja2 file (file extension .j2) that specifies the criteria that an LLM (as configured in the Eval Client) should follow when evaluating an instance. The template is allowed to have placeholders for the following variables (NOTE: not all are required): - gen_output_a: Model A’s generated output - gen_output_b: Model B’s generated output - user_query: The prompt - src_a: The source text for Model A - src_b: The source text for Model B - ref_output: The reference output

The prompt template should also specify the final available assessments for the LLM evaluator, e.g. “Response A”, “Response B”, “Tie”, etc. The score map should then map each of those available assessments to a numerical score. E.g. if the available assessments in the prompt template are “Response A”, “Response B”, and “Tie”, the score map should be something like: score_map = {'Response A': 0.0, 'Response B': 1.0, 'Tie': 0.5}

NOTE: If enforce_consistency is True, please make sure that the score map is symmetric, in the sense that swapping Model A and Model B should result in inverse scores. See the code below for more details.

NOTE: We have found that LLM models sometimes behave weirdly when the assessments are non-ascii characters (see citadel-ai/langcheck#84 as an example). So, we recommend making the final assessments ascii characters, even when the rest of the prompt template contains non-ascii characters (e.g. Japanese).

Parameters:
  • generated_outputs_a – Model A’s generated output(s)

  • generated_outputs_b – Model B’s generated output(s)

  • prompts – The prompts used to generate the output(s)

  • sources_a – The source(s) for Model A’s generated output(s)

  • sources_b – The source(s) for Model B’s generated output(s)

  • reference_outputs – The reference output(s)

  • eval_model – The EvalClient instance used for the evaluation

  • metric_name – The name of the metric

  • score_map – A dictionary mapping the evaluator’s assessments to scores

  • template_path – The path to the prompt template file. This should be a Jinja2 file (file extension .j2).

  • language – The language that the evaluator will use (‘en’, ‘ja’, or ‘de’)

  • enforce_consistency – When this is True, we will only return a score if the score is the same when Model A and Model B are swapped. This is useful for ensuring that the evaluator’s position bias is not impacting the scores. Default True.

Returns:

A MetricValue object