helpers

Helpers for editdistance

symspellpy.helpers.null_distance_results(string1, string2, max_distance)[source]

Determines the proper return value of an edit distance function when one or both strings are null.

Parameters
  • string_1 – Base string.

  • string_2 – The string to compare.

  • max_distance (int) – The maximum distance allowed.

Return type

int

Returns

-1 if the distance is greater than the max_distance, 0 if the strings are

equivalent (both are None), otherwise a positive number whose magnitude is the length of the string which is not None.

symspellpy.helpers.prefix_suffix_prep(string1, string2)[source]

Calculates starting position and lengths of two strings such that common prefix and suffix substrings are excluded. Expects len(string1) <= len(string2).

Parameters
  • string_1 – Base string.

  • string_2 – The string to compare.

Return type

Tuple[int, int, int]

Returns

A tuple of lengths of the part excluding common prefix and suffix, and

the starting position.

Helpers for symspellpy

class symspellpy.helpers.DictIO(dictionary, separator=' ')[source]

An iterator wrapper for python dictionary to format the output as required by load_dictionary_stream() and load_dictionary_bigram_stream().

Parameters
  • dictionary (Dict[str, int]) – Dictionary with words as keys and frequency count as values.

  • separator (str) – Separator characters between term(s) and count.

iteritems

An iterator object of dictionary.items().

separator

Separator characters between term(s) and count.

symspellpy.helpers.case_transfer_matching(cased_text, uncased_text)[source]

Transfers the casing from one text to another - assuming that they are ‘matching’ texts, alias they have the same length.

Parameters
  • cased_text (str) – Text with varied casing.

  • uncased_text (str) – Text that is in lowercase only.

Return type

str

Returns

Text with the content of uncased_text and the casing of cased_text.

Raises

ValueError – If the input texts have different lengths.

symspellpy.helpers.case_transfer_similar(cased_text, uncased_text)[source]

Transfers the casing from one text to another - for similar (not matching) text.

Use difflib.SequenceMatcher to identify the different type of changes needed to turn cased_text into uncased_text.

  • For inserted sections: transfer the casing from the prior character. If no character before or the character before is the space, transfer the casing from the following character.

  • For deleted sections: no case transfer is required.

  • For equal sections: swap out the text with the original, the cased one, a otherwise the two are the same.

  • For replaced sections: transfer the casing using case_transfer_matching() if the two has the same length, otherwise transfer character-by-character and carry the last casing over to any additional characters.

Parameters
  • cased_text (str) – Text with varied casing.

  • uncased_text (str) – Text in lowercase.

Return type

str

Returns

Text with the content of uncased_text but the casing of cased_text.

Raises

ValueError – If cased_text is empty.

symspellpy.helpers.increment_count(count, count_previous)[source]

Increments count up to sys.maxsize.

Return type

int

symspellpy.helpers.is_acronym(word, contain_digits=False)[source]

Checks if the word is all caps (acronym) and/or contain numbers.

Parameters
  • word (str) – The word to check

  • contain_digits (bool) – A flag to determine whether any term with digits can be considered as acronym

Return type

bool

Returns

True if the word is all caps and/or contain numbers, e.g., ABCDE, AB12C,

abc12, ab12c. False if the word contains lower case letters, e.g., abcde, ABCde, abcDE, abCDe.

symspellpy.helpers.parse_words(phrase, preserve_case=False, split_by_space=False)[source]

Creates a non-unique wordlist from sample text. Language independent (e.g. works with Chinese characters)

Parameters
  • phrase (str) – Sample text that could contain one or more words.

  • preserve_case (bool) – A flag to determine if we can to preserve the cases or convert all to lowercase.

  • split_by_space (bool) – Splits the phrase into words simply based on space.

Return type

List[str]

Returns

A list of words

symspellpy.helpers.try_parse_int64(string)[source]

Converts the string representation of a number to its 64-bit signed integer equivalent.

Parameters

string (str) – String representation of a number.

Return type

Optional[int]

Returns

The 64-bit signed integer equivalent, or None if conversion failed or if

the number is less than the min value or greater than the max value of a 64-bit signed integer.

Misc

symspellpy.helpers.to_similarity(distance, length)[source]

Calculates a similarity measure from an edit distance.

Parameters
  • distance (int) – The edit distance between two strings.

  • length (int) – The length of the longer of the two strings the edit distance is from.

Return type

float

Returns

A similarity value from 0 to 1.0 (1 - (length / distance)), -1 if

distance is negative