langcheck.augment#

langcheck.augment contains all of LangCheck’s text augmentations.

These augmentations can help you automatically generate test cases to evaluate model robustness to different prompts, typos, gender changes, and more.

Currently, only English text augmentations are available. Japanese text augmentations are in development.

Tip

As a shortcut, all English text augmentations are directly accessible from langcheck.augment. For example, you can directly run langcheck.augment.keyboard_typo() instead of langcheck.augment.en.keyboard_typo().

LangCheck’s augmentation functions can take either a single string or a list of strings as input. Optionally, you can set the num_perturbations parameter for most augmentations (except the deterministic ones), which specifies how many perturbed instances to return for each string.

To see more details about each augmentation, refer to the API reference below.


langcheck.augment.change_case(instances: list[str] | str, *, to_case: str = 'uppercase', aug_char_p: float = 1.0, num_perturbations: int = 1) list[str][source]#

Applies a text perturbation to each string in instances (usually a list of prompts) where some characters are changed to uppercase or lowercase.

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • to_case – Either ‘uppercase’ or ‘lowercase’.

  • aug_char_p – Percentage of all characters that will be augmented.

  • num_perturbations – The number of perturbed instances to generate for each string in instances.

Returns:

A list of perturbed instances.

langcheck.augment.gender(texts: Iterable[str] | str, *, to_gender: str = 'plural') list[str][source]#

Replace pronouns with that of specified gender.

Parameters:
  • texts – Iterable of texts to be augmented.

  • to_gender – Replacing pronoun type string (‘male’, ‘female’,

  • 'neutral'

  • plural. (or 'plural'). Default to) –

Returns:

List of sentences with replaced pronouns.

Note

Replacing neopronouns with other neopronouns is not supported yet because NLTK does not recognize them.

langcheck.augment.keyboard_typo(instances: list[str] | str, *, num_perturbations: int = 1, **kwargs) list[str][source]#

Applies a keyboard typo text perturbation to each string in instances (usually a list of prompts).

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • num_perturbations – The number of perturbed instances to generate for each string in instances

  • aug_char_p – Percentage of characters (per token) that will be augmented. Defaults to 0.1.

  • aug_char_max – Maximum number of characters which will be augmented. Defaults to None.

  • aug_word_max – Maximum number of words which will be augmented. Defaults to None.

  • include_special_char – Allow special characters to be augmented. Defaults to False.

  • include_numeric – Allow numeric characters to be augmented. Defaults to False.

Note

Any argument that can be passed to nlpaug.augmenter.char.keyboard.KeyboardAug is acceptable. Some of the more useful ones from nlpaug document are listed below:

  • aug_char_p (float): Percentage of character (per token) will be augmented.

  • aug_char_min (int): Minimum number of character will be augmented.

  • aug_char_max (int): Maximum number of character will be augmented.

  • aug_word_p (float): Percentage of word will be augmented.

  • aug_word_min (int): Minimum number of word will be augmented.

  • aug_word_max (int): Maximum number of word will be augmented.

Note that the default values for these arguments may be different from the nlpaug defaults.

Returns:

A list of perturbed instances.

langcheck.augment.ocr_typo(instances: list[str] | str, *, num_perturbations: int = 1, **kwargs) list[str][source]#

Applies an OCR typo text perturbation to each string in instances (usually a list of prompts).

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • num_perturbations – The number of perturbed instances to generate for each string in instances

  • aug_char_p – Percentage of characters (per token) that will be augmented. Defaults to 0.1.

  • aug_char_max – Maximum number of characters which will be augmented. Defaults to None.

  • aug_word_max – Maximum number of words which will be augmented. Defaults to None.

Note

Any argument that can be passed to nlpaug.augmenter.char.ocr.OcrAug is acceptable. Some of the more useful ones from the nlpaug documentation are listed below:

  • aug_char_p (float): Percentage of characters (per token) that will be augmented.

  • aug_char_min (int): Minimum number of characters that will be augmented.

  • aug_char_max (int): Maximum number of characters that will be augmented.

  • aug_word_p (float): Percentage of words that will be augmented.

  • aug_word_min (int): Minimum number of words that will be augmented.

  • aug_word_max (int): Maximum number of words that will be augmented.

Note that the default values for these arguments may be different from the nlpaug defaults.

Returns:

A list of perturbed instances.

langcheck.augment.remove_punctuation(instances: list[str] | str, *, aug_char_p: float = 1.0, num_perturbations: int = 1) list[str][source]#

Applies a text perturbation to each string in instances (usually a list of prompts) where some punctuation is removed.

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • aug_char_p – Percentage of puncutation characters that will be removed.

  • num_perturbations – The number of perturbed instances to generate for each string in instances.

Returns:

A list of perturbed instances.

langcheck.augment.rephrase(instances: list[str] | str, *, num_perturbations: int = 1, model_type: str = 'openai', openai_client: OpenAI | None = None, openai_args: dict[str, str] | None = None) list[str | None][source]#

Rephrases each string in instances (usually a list of prompts) without changing their meaning. We use a modified version of the prompt presented in “Rethinking Benchmark and Contamination for Language Models with Rephrased Samples” to make an LLM rephrase the given text.

We currently support two model types:

1. The ‘openai’ type, where we use OpenAI’s ‘gpt-turbo-3.5’ model by default.

2. The ‘azure_openai’ type. Essentially the same as the ‘openai’ type, except that it uses the AzureOpenAI client. Note that you must specify the model to use in openai_args, e.g. openai_args={'model': 'YOUR_DEPLOYMENT_NAME'}

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • num_perturbations – The number of perturbed instances to generate for each string in instances

  • model_type – The type of model to use (‘openai’ or ‘azure_openai’), default ‘openai’

  • openai_client – OpenAI or AzureOpenAI client, default None. If this is None, we will attempt to create a default client.

  • openai_args – Dict of additional args to pass in to the client.chat.completions.create function, default None

Returns:

A list of rephrased instances.

langcheck.augment.synonym(instances: list[str] | str, *, num_perturbations: int = 1, **kwargs) list[str][source]#

Applies a text perturbation to each string in instances (usually a list of prompts) where some words are replaced with synonyms.

Parameters:
  • instances – A single string or a list of strings to be augmented.

  • num_perturbations – The number of perturbed instances to generate for each string in instances

  • aug_p – Percentage of words which will be augmented. Defaults to 0.1.

  • aug_max – Maximum number of words which will be augmented. Defaults to None.

Note

Any argument that can be passed to nlpaug.augmenter.word.SynonymAug is acceptable. Some of the more useful ones from the nlpaug documention are listed below:

  • aug_p (float): Percentage of words which will be augmented.

  • aug_min (int): Minimum number of words that will be augmented.

  • aug_max (int): Maximum number of words that will be augmented.

Note that the default values for these arguments may be different from the nlpaug defaults.

Returns:

A list of perturbed instances.