langcheck.metrics.eval_clients

langcheck.metrics.eval_clients#

class langcheck.metrics.eval_clients.AnthropicEvalClient(anthropic_client: Anthropic | AsyncAnthropic | AnthropicVertex | AsyncAnthropicVertex | None = None, anthropic_args: dict[str, Any] | None = None, *, use_async: bool = False, vertexai: bool = False, system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for Anthropic API.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts. We use Anthropic’s ‘claude-3-haiku-20240307’ model by default, but you can configure it by passing the ‘model’ parameter in the anthropic_args.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer()[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.AnthropicExtractor(anthropic_client: Anthropic | AsyncAnthropic | AnthropicVertex | AsyncAnthropicVertex | None = None, anthropic_args: dict[str, Any] | None = None, *, use_async: bool = False, vertexai: bool = False)[source]#

Bases: Extractor

Score extractor for Anthropic API.

get_float_score(metric_name: str, language: str, unstructured_assessment_result: list[str | None], score_map: dict[str, float], *, tqdm_description: str | None = None) → ResponsesWithMetadata[float][source]#

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores.

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.AzureOpenAIEvalClient(text_model_name: str | None = None, embedding_model_name: str | None = None, azure_openai_client: AzureOpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False, use_reasoning_summary: bool = False, reasoning_effort: Literal['minimal', 'low', 'medium', 'high'] | None = 'medium', reasoning_summary: Literal['auto', 'concise', 'detailed'] | None = 'auto', system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: OpenAIEvalClient

similarity_scorer() → OpenAISimilarityScorer[source]#: This method does the sanity check for the embedding_model_name and then calls the parent class’s similarity_scorer method with the additional “model” parameter. See the parent class for the detailed documentation.

class langcheck.metrics.eval_clients.AzureOpenAIExtractor(text_model_name: str | None = None, azure_openai_client: AzureOpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False)[source]#: Bases: OpenAIExtractor

class langcheck.metrics.eval_clients.EvalClient[source]#

Bases: object

An abstract class that defines the interface for the evaluation clients. Most metrics that use external APIs such as OpenAI API call the methods defined in this class to compute the metric values.

compute_metric_values_from_template(metric_inputs: MetricInputs, template: Template, metric_name: str, language: str, score_map: dict[str, float]) → MetricValue[float | None][source]#

Compute the metric values from the given Jinja template with the metric inputs. This function assumes that the template parameters are already validated and the template is ready to be rendered.

Parameters:

metric_inputs – The metric inputs that contain the prompts, generated outputs, reference outputs… etc.
template – The Jinja template that is ready to be rendered.
enforce_pairwise_consistency – Whether to enforce pairwise consistency when computing the metric values.
metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

Returns:

The metric values computed from the template.

Return type:

MetricValue

Give scores to texts embedded in the given prompts. The function itself calls get_text_responses and get_float_score to get the scores. The function returns the scores and the unstructured explanation strings.

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
prompts – The prompts that contain the original text to be scored, the evaluation criteria… etc. Typically it is based on the Jinja prompt templates and instantiated withing each metric function.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
intermediate_tqdm_description – The description to be shown in the tqdm bar for the unstructured assessment.
score_tqdm_description – The description to be shown in the tqdm bar for the score calculation.

Returns:

A tuple of two lists. The first list contains the scores for each prompt and the second list contains the unstructured assessment results for each prompt. Both can be None if the evaluation fails.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts. Each concrete subclass needs to define the concrete implementation of this function to enable text scoring.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

get_text_responses_with_log_likelihood(prompts: list[str], top_logprobs: int | None = None, *, tqdm_description: str | None = None) → ResponsesWithMetadata[dict[str, str | list[list[list[dict[str, str | float]]]]]][source]#

The function that gets responses with log likelihood to the given prompt texts. Each concrete subclass needs to define the concrete implementation of this function to enable text scoring.

Parameters:

prompts – The prompts you want to get the responses for.
top_logprobs – The number of logprobs to return for each token.

Returns:

A list of responses to the prompts. Each response is a tuple of the output text and the list of tuples of the output tokens and the log probabilities. The responses can be None if the evaluation fails.

load_prompt_template(language: str, metric_name: str, eval_prompt_version: str | None = None) → Template[source]#

Gets a Jinja template from the specified language, eval client, metric name, and (optionally) eval prompt version.

Parameters:

language (str) – The language of the template.
metric_name (str) – The name of the metric.
eval_prompt_version (str | None) – The version of the eval prompt. If None, the default version is used.

Returns:

The Jinja template.

Return type:

Template

repeat_requests_from_template(prompt_template_inputs: list[dict[str, str]], template: Template, num_perturbations: int = 1) → ResponsesWithMetadata[str][source]#

Repeats the request using the given Jinja template for num_perturbations times. Note that every EvalClient subclass is expected to implement get_text_responses method to get different responses for the same input.

Parameters:

instances – A single string or a list of strings to be augmented.
template – The Jinja template ready to be rendered.
num_perturbations – The number of perturbed instances to generate for each string in instances.

Returns:

A list of responses for each input. If num_pertuations is > 1, the multiple responses for the same input are included consecutively.

similarity_scorer() → BaseSimilarityScorer[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.GeminiEvalClient(model_name: str = 'gemini-1.5-flash', embed_model_name: str | None = None, generate_content_args: dict[str, Any] | None = None, genai_client: Client | None = None, *, use_async: bool = False, vertexai: bool = False, system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for the Gemini model.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer() → GeminiSimilarityScorer[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.GeminiExtractor(model_name: str = 'gemini-1.5-flash', genai_client: Client | None = None, generate_content_args: dict[str, Any] | None = None, *, use_async: bool = False, vertexai: bool = False)[source]#

Bases: Extractor

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. We leverage the structured output API to extract the short assessment results from the unstructured assessments, so please make sure that the model you use supports structured output (See the References for more details).

References

https://ai.google.dev/gemini-api/docs/structured-output

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
tqdm_description (Optional) – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.LiteLLMEvalClient(model: str, embedding_model: str | None = None, *, use_async: bool = False, use_reasoning_summary: bool = False, reasoning_effort: Literal['minimal', 'low', 'medium', 'high'] | None = 'medium', reasoning_summary: Literal['auto', 'concise', 'detailed'] | None = 'auto', system_prompt: str | None = None, extractor: Extractor | None = None, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, **kwargs)[source]#

Bases: EvalClient

EvalClient defined for litellm.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

The function that gets responses with log likelihood to the given prompt texts. Each concrete subclass needs to define the concrete implementation of this function to enable text scoring.

NOTE: Please make sure that the model you use supports logprobs. (https://docs.litellm.ai/docs/completion/input#translated-openai-params)

Parameters:

prompts – The prompts you want to get the responses for.
top_logprobs – The number of logprobs to return for each token.

Returns:

similarity_scorer() → LiteLLMSimilarityScorer[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.LiteLLMExtractor(model: str, *, use_async: bool = False, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, **kwargs)[source]#

Bases: Extractor

Score extractor defined for litellm.

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. instructor is used to extract the result with robust structured outputs.

References

https://docs.litellm.ai/docs/tutorials/instructor

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.LlamaEvalClient(model_name: str = 'tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1', torch_dtype: str = 'bfloat16', tensor_parallel_size: int = 1, device: str = 'cuda', *, system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for the Llama-based models. It currently only supports English and Japanese. The default model is set to “tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1”. The following models are also available: - tokyotech-llm/Llama-3-Swallow-70B-Instruct-v0.1 - elyza/Llama-3-ELYZA-JP-8B - rinna/llama-3-youko-8b-instruct - rinna/llama-3-youko-70b-instruct - meta-llama/Meta-Llama-3.1-8B-Instruct - meta-llama/Meta-Llama-3.1-70B-Instruct To use the 70B models, set tensor_parallel_size to 8 or more. To use the Llama 3.1 models, you need to agree to the terms of service and login with your huggingface account.

get_score(metric_name: str, language: str, prompts: str | list[str], score_map: dict[str, float]) → tuple[list[float | None], list[str | None], MetricTokenUsage | None][source]#

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
prompts – The prompts that contain the original text to be scored, the evaluation criteria… etc. Typically it is based on the Jinja prompt templates and instantiated withing each metric function.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

Returns:

A tuple of two lists. The first list contains the scores for each prompt and the second list contains the unstructured assessment results for each prompt. Both can be None if the evaluation fails.

get_text_responses(prompts: list[str], language: str, *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that generates responses to the given prompt texts.

Parameters:

prompts – The prompts you want to get the responses for.
language – The language of the prompts. (e.g. “en”)

Returns:

A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer()[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.LlamaExtractor(model_name: str = 'tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1', torch_dtype: str = 'bfloat16', tensor_parallel_size: int = 1, device: str = 'cuda', *, model: LLM | None = None, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, sampling_params: SamplingParams | None = None, system_prompt: str | None = None)[source]#

Bases: Extractor

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores.

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.MetricTokenUsage(input_token_count: int | None = None, output_token_count: int | None = None, input_token_cost: float | None = None, output_token_cost: float | None = None)[source]#

Bases: object

A dataclass that contains the token usage information for a metric.

Parameters:

input_token_count – The number of input tokens used.
output_token_count – The number of output tokens used.
input_token_cost – The cost of the input tokens.
output_token_cost – The cost of the output tokens.

input_token_cost: float | None = None#

input_token_count: int | None = None#

output_token_cost: float | None = None#

output_token_count: int | None = None#

class langcheck.metrics.eval_clients.OpenAIEvalClient(openai_client: OpenAI | AsyncOpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False, use_reasoning_summary: bool = False, reasoning_effort: Literal['minimal', 'low', 'medium', 'high'] | None = 'medium', reasoning_summary: Literal['auto', 'concise', 'detailed'] | None = 'auto', system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for OpenAI API.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts. We use OpenAI’s ‘gpt-4o-mini’ model by default, but you can configure it by passing the ‘model’ parameter in the openai_args.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

The function that gets responses with log likelihood to the given prompt texts. Each concrete subclass needs to define the concrete implementation of this function to enable text scoring. This is not available for reasoning models.

NOTE: Please make sure that the model you use supports logprobs. In Azure OpenAI, the API version 2024-06-01 is the earliest GA version that supports logprobs (https://learn.microsoft.com/en-us/azure/ai-services/openai/whats-new#new-ga-api-release).

Parameters:

prompts – The prompts you want to get the responses for.
top_logprobs – The number of logprobs to return for each token.

Returns:

similarity_scorer() → OpenAISimilarityScorer[source]#: https://openai.com/blog/new-embedding-models-and-api-updates

class langcheck.metrics.eval_clients.OpenAIExtractor(openai_client: OpenAI | AsyncOpenAI | None = None, openai_args: dict[str, str] | None = None, *, use_async: bool = False)[source]#

Bases: Extractor

Score extractor defined for OpenAI API.

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores. We leverage the structured outputs API to extract the short assessment results from the unstructured assessments, so please make sure that the model you use supports structured outputs (only available in OpenAI’s latest LLMs starting with GPT-4o). Also note that structured outputs API is only available in OpenAI API version of 2024-08-01-preview or later (See the References for more details).

References

https://platform.openai.com/docs/guides/structured-outputs?api-mode=chat

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.OpenRouterEvalClient(openrouter_args: dict[str, str] | None = None, *, system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for the OpenRouter API.

get_score(metric_name: str, language: str, prompts: str | list[str], score_map: dict[str, float]) → tuple[list[float | None], list[str | None]][source]#

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
prompts – The prompts that contain the original text to be scored, the evaluation criteria… etc. Typically it is based on the Jinja prompt templates and instantiated withing each metric function.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

Returns:

A tuple of two lists. The first list contains the scores for each prompt and the second list contains the unstructured assessment results for each prompt. Both can be None if the evaluation fails.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that gets responses to the given prompt texts. The user’s default OpenRouter model is used by default, but you can configure it by passing the ‘model’ parameter in the openrouter_args.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

similarity_scorer()[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.OpenRouterExtractor(openrouter_args: dict[str, str] | None = None)[source]#

Bases: Extractor

Score extractor defined for the OpenRouter API.

The function that transforms the unstructured assessments (i.e. long texts that describe the evaluation results) into scores.

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
unstructured_assessment_result – The unstructured assessment results for the given assessment prompts.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.
tqdm_description – The description to be shown in the tqdm bar.

Returns:

A list of scores for the given prompts. The scores can be None if the evaluation fails.

class langcheck.metrics.eval_clients.PrometheusEvalClient(model_name: str = 'prometheus-eval/prometheus-7b-v2.0', torch_dtype: str = 'bfloat16', tensor_parallel_size: int = 1, device: str = 'cuda', *, system_prompt: str | None = None, extractor: Extractor | None = None)[source]#

Bases: EvalClient

EvalClient defined for the Prometheus 2 model. This eval client currently supports only English. Presented in “Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models” <https://arxiv.org/abs/2405.01535>. We adapted the prompts in <prometheus-eval/prometheus- eval/blob/main/libs/prometheus-eval/prometheus_eval/prompts.py>.

get_score(metric_name: str, language: str, prompts: str | list[str], score_map: dict[str, float]) → tuple[list[float | None], list[str | None]][source]#

Parameters:

metric_name – The name of the metric to be used. (e.g. “toxicity”)
language – The language of the prompts. (e.g. “en”)
prompts – The prompts that contain the original text to be scored, the evaluation criteria… etc. Typically it is based on the Jinja prompt templates and instantiated withing each metric function.
score_map – The mapping from the short assessment results (e.g. “Good”) to the scores.

Returns:

A tuple of two lists. The first list contains the scores for each prompt and the second list contains the unstructured assessment results for each prompt. Both can be None if the evaluation fails.

get_text_responses(prompts: list[str], *, tqdm_description: str | None = None) → ResponsesWithMetadata[str][source]#

The function that generates responses to the given prompt texts.

Parameters:: prompts – The prompts you want to get the responses for.
Returns:: A list of responses to the prompts. The responses can be None if the evaluation fails.

load_prompt_template(language: str, metric_name: str, eval_prompt_version: str | None = None) → Template[source]#

Gets a Jinja template from the specified language, eval client, metric name, and (optionally) eval prompt version.

Parameters:

language (str) – The language of the template.
metric_name (str) – The name of the metric.
eval_prompt_version (str | None) – The version of the eval prompt. If None, the default version is used.

Returns:

The Jinja template.

Return type:

Template

similarity_scorer()[source]#: Get the BaseSimilarityScorer object that corresponds to the EvalClient so that the similarity-related metrics can be computed. TODO: Intergrate scorer/ with eval_clients/

class langcheck.metrics.eval_clients.ResponsesWithMetadata(response_texts: list[T | None], token_usage: MetricTokenUsage | None)[source]#

Bases: list[T | None], Generic[T]

A backward-compatible list subclass that carries additional token usage information.

This class extends the built-in list to preserve existing behavior for callers that expect a plain list, ensuring backward compatibility after the function’s return type is expanded.

The motivation is to allow returning both the original list data and token usage information without breaking existing code that iterates over or mutates the list.

Example

>>> responses = fn()
>>> responses.append("new_item")   # still works like a list
>>> responses.token_usage          # token usage information is available

token_usage: MetricTokenUsage | None = None#

langcheck.metrics.eval_clients

Contents

langcheck.metrics.eval_clients#