langcheck.augment.ja#
- langcheck.augment.ja.conv_hiragana(instances: list[str] | str, convert_to: str = 'kata', *, aug_char_p: float = 1.0, num_perturbations: int = 1, seed: int | None = None) list[str][source]#
Convert hiragana in the text to katakana or vice versa.
- Parameters:
instances – A single string or a list of strings to be augmented.
convert_to – The target script to convert to. Available values are - ‘kata’ for katakana - ‘hkata’ for half-width katakana - ‘alpha’ for alphabets
aug_char_p – Percentage of all characters that will be augmented.
num_perturbations – The number of perturbed instances to generate for each string in instances.
seed – The seed for the random number generator. You can fix the seed to deterministically choose which characters to change.
- Returns:
A list of perturbed instances.
- langcheck.augment.ja.jailbreak_template(instances: list[str] | str, templates: list[str] | None = None, *, num_perturbations: int = 1, randomize_order: bool = True, seed: int | None = None) list[str][source]#
Applies jailbreak templates to each string in instances.
- Parameters:
instances – A single string or a list of strings to be augmented.
templates – A list templates to apply. If None, some templates are randomly selected and used. Available templates are: - basic - chatgpt_good_vs_evil - john
num_perturbations – The number of perturbed instances to generate for each string in instances. Should be equal to or less than the number of templates.
randomize_order – If True, the order of the templates is randomized. When turned off, num_perturbations needs to be equal to the number of templates.
seed – The seed for the random number generator. You can fix the seed to deterministically select the same templates.
- Returns:
A list of perturbed instances.
- langcheck.augment.ja.payload_splitting(instances: list[str] | str, *, num_perturbations: int = 1, seed: int | None = None) list[str][source]#
Applies payload splitting augmentation to each string in instances.
Ref: https://arxiv.org/pdf/2302.05733
- Parameters:
instances – A single string or a list of strings to be augmented.
num_perturbations – The number of perturbed instances to generate for each string in instances. Should be equal to or less than the number of templates.
seed – The seed for the random number generator. You can fix the seed to deterministically choose the indices to split the instances.
- Returns:
A list of perturbed instances.
- langcheck.augment.ja.synonym(instances: list[str] | str, *, num_perturbations: int = 1, seed: int | None = None, **kwargs) list[str][source]#
Applies a text perturbation to each string in instances (usually a list of prompts) where some words are replaced with synonyms.
- Parameters:
instances – A single string or a list of strings to be augmented.
num_perturbations – The number of perturbed instances to generate for each string in instances
aug_p – Percentage of words with synonymous which will be augmented. Defaults to 0.8.
seed – The seed for the random number generator. You can fix the seed to deterministically choose which words to change.
- Returns:
A list of perturbed instances.
Note
This function requires sudachidict_core and sudachipy to be installed in your environment. Please refer to the official instructions to install them.