langcheck.metrics.eval_clients#

class langcheck.metrics.eval_clients.AnthropicEvalClient(anthropic_client: Anthropic | None = None, anthropic_args: dict[str, Any] | None = None, *, use_async: bool = False)[source]#

Bases: EvalClient

EvalClient defined for Anthropic API.

get_float_score(metric_name: str, language: str, unstructured_assessment_result: list[str | None], score_map: dict[str, float], *, tqdm_description: str | None = None) list[float | None][source]#

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores.

Parameters:
  • metric_name – The name of the metric to be used. (e.g. “toxicity”)

  • language – The language of the prompts. (e.g. “en”)

  • unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.

  • score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

  • tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

get_text_responses(prompts: Iterable[str], *, tqdm_description: str | None = None) list[str | None][source]#

The function that gets resonses to the given prompt texts. We use Anthropic’s ‘claude-3-haiku-20240307’ model by default, but you can configure it by passing the ‘model’ parameter in the anthropic_args.

Parameters:

prompts – The prompts you want to get the responses for.

Returns:

A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer()[source]#

Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.AzureOpenAIEvalClient(text_model_name: str | None = None, embedding_model_name: str | None = None, azure_openai_client: AzureOpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False)[source]#

Bases: OpenAIEvalClient

get_score(metric_name: str, language: str, prompts: str | Iterable[str], score_map: dict[str, float], *, intermediate_tqdm_description: str | None = None, score_tqdm_description: str | None = None) tuple[list[float | None], list[str | None]][source]#

This method does the sanity check for the text_model_name and then calls the parent class’s get_score method with the additional “model” parameter. See the parent class for the detailed documentation.

similarity_scorer() OpenAISimilarityScorer[source]#

This method does the sanity check for the embedding_model_name and then calls the parent class’s similarity_scorer method with the additional “model” parameter. See the parent class for the detailed documentation.

class langcheck.metrics.eval_clients.EvalClient[source]#

Bases: object

An abstract class that defines the interface for the evaluation clients. Most metrics that use external APIs such as OpenAI API call the methods defined in this class to compute the metric values.

get_float_score(metric_name: str, language: str, unstructured_assessment_result: list[str | None], score_map: dict[str, float], *, tqdm_description: str | None = None) list[float | None][source]#

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. A typical workflow can be:

1. Extract a short assessment result strings from the unstructured assessment results.

2. Map the short assessment result strings to the scores using the score_map.

Each concrete subclass needs to define the concrete implementation of this function to enable text scoring.

Parameters:
  • metric_name – The name of the metric to be used. (e.g. “toxicity”)

  • language – The language of the prompts. (e.g. “en”)

  • unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.

  • score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

  • tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

get_score(metric_name: str, language: str, prompts: str | Iterable[str], score_map: dict[str, float], *, intermediate_tqdm_description: str | None = None, score_tqdm_description: str | None = None) tuple[list[float | None], list[str | None]][source]#

Give scores to texts embedded in the given prompts. The function itself calls get_text_responses and get_float_score to get the scores. The function returns the scores and the unstructured explanation strings.

Parameters:
  • metric_name – The name of the metric to be used. (e.g. “toxicity”)

  • language – The language of the prompts. (e.g. “en”)

  • prompts – The prompts that contain the original text to be scored, the evaluation criteria… etc. Typically it is based on the Jinja prompt templates and instantiated withing each metric function.

  • score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

  • intermediate_tqdm_description – The description to be shown in the tqdm bar for the unstructured assessment.

  • score_tqdm_description – The description to be shown in the tqdm bar for the score calculation.

Returns:

A tuple of two lists. The first list contains the scores for each prompt and the second list contains the unstructured assessment results for each prompt. Both can be None if the evaluation fails.

get_text_responses(prompts: Iterable[str], *, tqdm_description: str | None = None) list[str | None][source]#

The function that gets resonses to the given prompt texts. Each concrete subclass needs to define the concrete implementation of this function to enable text scoring.

Parameters:

prompts – The prompts you want to get the responses for.

Returns:

A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer() BaseSimilarityScorer[source]#

Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.GeminiEvalClient(model: genai.GenerativeModel | None = None, model_args: dict[str, Any] | None = None, generate_content_args: dict[str, Any] | None = None, embed_model_name: str | None = None)[source]#

Bases: EvalClient

EvalClient defined for the Gemini model.

get_float_score(metric_name: str, language: str, unstructured_assessment_result: list[str | None], score_map: dict[str, float], *, tqdm_description: str | None = None) list[float | None][source]#

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. We leverage the function calling API to extract the short assessment results from the unstructured assessments, so please make sure that the model you use supports function calling (https://ai.google.dev/gemini-api/docs/function-calling#supported-models).

Ref:

https://ai.google.dev/gemini-api/docs/function-calling

Parameters:
  • metric_name – The name of the metric to be used. (e.g. “toxicity”)

  • language – The language of the prompts. (e.g. “en”)

  • unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.

  • score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

  • tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

get_text_responses(prompts: Iterable[str], *, tqdm_description: str | None = None) list[str | None][source]#

The function that gets resonses to the given prompt texts.

Parameters:

prompts – The prompts you want to get the responses for.

Returns:

A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer() GeminiSimilarityScorer[source]#

Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.OpenAIEvalClient(openai_client: OpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False)[source]#

Bases: EvalClient

EvalClient defined for OpenAI API.

get_float_score(metric_name: str, language: str, unstructured_assessment_result: list[str | None], score_map: dict[str, float], *, tqdm_description: str | None = None) list[float | None][source]#

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. We leverage the function calling API to extract the short assessment results from the unstructured assessments, so please make sure that the model you use supports function calling (https://platform.openai.com/docs/guides/gpt/function-calling).

Ref:

https://platform.openai.com/docs/guides/gpt/function-calling

Parameters:
  • metric_name – The name of the metric to be used. (e.g. “toxicity”)

  • language – The language of the prompts. (e.g. “en”)

  • unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.

  • score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

  • tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

get_text_responses(prompts: Iterable[str], *, tqdm_description: str | None = None) list[str | None][source]#

The function that gets resonses to the given prompt texts. We use OpenAI’s ‘gpt-turbo-3.5’ model by default, but you can configure it by passing the ‘model’ parameter in the openai_args.

Parameters:

prompts – The prompts you want to get the responses for.

Returns:

A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer() OpenAISimilarityScorer[source]#

https://openai.com/blog/new-embedding-models-and-api-updates