symspellpy, a Python port of SymSpell

symspellpy.py

class symspellpy.symspellpy.Composition(segmented_string, corrected_string, distance_sum, log_prob_sum)
_asdict()

Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable)

Make a new Composition object from a sequence or iterable

_replace(**kwds)

Return a new Composition object replacing specified fields with new values

corrected_string

Alias for field number 1

distance_sum

Alias for field number 2

log_prob_sum

Alias for field number 3

segmented_string

Alias for field number 0

class symspellpy.symspellpy.SuggestItem(term, distance, count)

Spelling suggestion returned from lookup().

Attributes:

  • _term: The suggested correctly spelled word.
  • _distance: Edit distance between searched for word and suggestion.
  • _count: Frequency of suggestion in the dictionary (a measure of how common the word is).
__eq__(other)

Order by distance ascending, then by frequency count descending

__init__(term, distance, count)

Create a new instance of SuggestItem.

Args:

  • term (str): The suggested word.
  • distance (int): Edit distance from search word.
  • count (int): Frequency of suggestion in dictionary.
class symspellpy.symspellpy.SymSpell(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1, compact_level=5)

Symmetric Delete spelling correction algorithm

Attributes:

  • _words: Dictionary of unique correct spelling words, and the frequency count for each word.
  • _below_threshold_words: Dictionary of unique words that are below the count threshold for being considered correct spellings.
  • _deletes: Dictionary that contains a mapping of lists of suggested correction words to the hashCodes of the original words and the deletes derived from them. Collisions of hashCodes is tolerated, because suggestions are ultimately verified via an edit distance function. A list of suggestions might have a single suggestion, or multiple suggestions.
  • _max_dictionary_edit_distance: Maximum dictionary term length.
  • _prefix_length = The length of word prefixes used for spell checking
  • _count_threshold: A threshold might be specified, when a term occurs so frequently in the corpus that it is considered a valid word for spelling correction.
  • _compact_mask: Used for generating string hash
  • _distance_algorithm: Edit distance algorithms
  • _max_length: Length of longest word in the dictionary.
  • _replaced_words: Dictionary corrected/modified words
__init__(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1, compact_level=5)

Create a new instance of SymSpell. initial_capacity from the original code is omitted since python cannot preallocate memory

Args:

  • max_dictionary_edit_distance (int): Maximum edit distance for doing lookups. (default 2)
  • prefix_length (int): The length of word prefixes used for spell checking. (default 7)
  • count_threshold (int): The minimum frequency count for dictionary words to be considered correct spellings. (default 1)
  • compact_level (int): Degree of favoring lower memory use over speed (0=fastest,most memory, 16=slowest,least memory). (default 5)

Raises:

  • ValueError: max_dictionary_edit_distance is negative
  • ValueError: prefix_length is less than 1 or smaller than max_dictionary_edit_distance
  • ValueError: count_threshold is negative
  • ValueError: compact_level is not between 0 and 16

NOTE: Currently hard coded to use the Damerau optimal string alignment algorithm

_delete_in_suggestion_prefix(delete, delete_len, suggestion, suggestion_len)

Check whether all delete chars are present in the suggestion prefix in correct order, otherwise this is just a hash collision

_edits(word, edit_distance, delete_words)

Inexpensive and language independent: only deletes, no transposes + replaces + inserts replaces and inserts are expensive and language dependent

_parse_words(text)

Create a non-unique wordlist from sample text language independent (e.g. works with Chinese characters)

create_dictionary(corpus, encoding=None)

Load multiple dictionary words from a file containing plain text. Merges with any dictionary data already loaded.

Args:

  • corpus (str) The path+filename of the file.
  • encoding (str): Text encoding of the corpus file

Returns: True if file loaded, or False if file not found.

create_dictionary_entry(key, count)

Create/Update an entry in the dictionary. For every word there are deletes with an edit distance of 1..max_edit_distance created and added to the dictionary. Every delete entry has a suggestions list, which points to the original term(s) it was created from. The dictionary may be dynamically updated (word frequency and new words) at any time by calling create_dictionary_entry

Args:

  • key (str): The word to add to dictionary.
  • count (int): The frequency count for word.

Returns: True if the word was added as a new correctly spelled word, or False if the word is added as a below threshold word, or updates an existing correctly spelled word.

delete_dictionary_entry(key)

Delete an entry in the dictionary. If the deleted entry is the longest word, update self._max_length with the next longest word

Args:

  • key (str): The word to add to dictionary.

Return: True if the word is successfully deleted, or False if the word is not found.

load_dictionary(corpus, term_index, count_index, separator=' ', encoding=None)

Load multiple dictionary entries from a file of word/frequency count pairs. Merges with any dictionary data already loaded.

Args:

  • corpus (str): The path+filename of the file.
  • term_index (int): The column position of the word.
  • count_index (int): The column position of the frequency count.
  • encoding (str): Text encoding of the dictionary file

Returns: True if file loaded, or False if file not found.

load_pickle(filename, compressed=True)

Load delete combination from file as pickle. This will reduce the loading time compared to running load_dictionary() again.

Args:

  • filename (str): The path+filename of the pickle file
  • compressed (bool): A flag to determine whether to read the pickled data as compressed data

Returns: True if delete combinations are successfully loaded.

load_pickle_stream(stream)

Load delete combination from stream as pickle. This will reduce the loading time compared to running load_dictionary() again.

Args:

  • stream (str): The stream where to load the pickle data

Returns: True if delete combinations are successfully loaded.

lookup(phrase, verbosity, max_edit_distance=None, include_unknown=False, ignore_token=None, transfer_casing=False)

Find suggested spellings for a given phrase word.

Args:

  • phrase (str): The word being spell checked.
  • verbosity (Verbosity): The value controlling the quantity/closeness of the returned suggestions.
  • max_edit_distance (int): The maximum edit distance between phrase and suggested words.
  • include_unknown (bool): A flag to determine whether to include phrase word in suggestions, if no words within edit distance found.
  • ignore_token (regex pattern): A regex pattern describing what words/phrases to ignore and leave unchanged
  • transfer_casing (bool): A flag to determine whether the
    casing (eg upper- vs lowercase) should be carried over from the phrase

Returns: A list of SuggestItem object representing suggested correct spellings for the phrase word, sorted by edit distance, and secondarily by count frequency.

Raises:

  • ValueError: max_edit_distance is greater than _max_dictionary_edit_distance
lookup_compound(phrase, max_edit_distance, ignore_non_words=False, transfer_casing=False)

lookup_compound supports compound aware automatic spelling correction of multi-word input strings with three cases:

  1. mistakenly inserted space into a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple independent input terms with/without spelling errors

Find suggested spellings for a multi-word input string (supports word splitting/merging).

Args:

  • phrase (str): The string being spell checked.
  • max_edit_distance (int): The maximum edit distance between input and suggested words.
  • transfer_casing (bool): A flag to determine whether the
    casing (eg upper- vs lowercase) should be carried over from the phrase

Returns: A list of SuggestItem object representing suggested correct spellings for the input string.

save_pickle(filename, compressed=True)

Pickle _deletes, _words, and _max_length into a file for quicker loading later.

Args:

  • filename (str): The path+filename of the pickle file
  • compressed (bool): A flag to determine whether to compress the pickled data
save_pickle_stream(stream)

Pickle _deletes, _words, and _max_length into a stream for quicker loading later.

Args:

  • stream (str): The stream where to save the pickle data
word_segmentation(phrase, max_edit_distance=None, max_segmentation_word_length=None, ignore_token=None)

word_segmentation divides a string into words by inserting missing spaces at the appropriate positions misspelled words are corrected and do not affect segmentation existing spaces are allowed and considered for optimum segmentation

word_segmentation uses a novel approach without recursion. https://medium.com/@wolfgarbe/fast-word-segmentation-for-noisy-text-2c2c41f9e8da While each string of length n can be segmented in 2^n−1 possible compositions https://en.wikipedia.org/wiki/Composition_(combinatorics) word_segmentation has a linear runtime O(n) to find the optimum composition

Find suggested spellings for a multi-word input string (supports word splitting/merging).

Args:

  • phrase (str): The string being spell checked.
  • max_segmentation_word_length (int): The maximum word length that should be considered.
  • max_edit_distance (int): The maximum edit distance between input and corrected words (0=no correction/segmentation only).
  • ignore_token (regex pattern): A regex pattern describing what words/phrases to ignore and leave unchanged

Returns: The word segmented string, the word segmented and spelling corrected string, the edit distance sum between input string and corrected string, the sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).

class symspellpy.symspellpy.Verbosity

Controls the closeness/quantity of returned spelling suggestions.

ALL = 2

All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency (slower, no early termination).

CLOSEST = 1

All suggestions of smallest edit distance found, suggestions ordered by term frequency.

TOP = 0

Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.

editdistance.py

class symspellpy.editdistance.AbstractDistanceComparer

An interface to compute relative distance between two strings

distance(string_1, string_2, max_distance)

Return a measure of the distance between two strings.

Args:

  • string_1 (str): One of the strings to compare.
  • string_2 (str): The other string to compare.
  • max_distance (int): The maximum distance that is of interest.

Return: -1 if the distance is greater than the max_distance, 0 if the strings are equivalent, otherwise a positive number whose magnitude increases as difference between the strings increases.

class symspellpy.editdistance.DamerauOsa

Class providing optimized methods for computing Damerau-Levenshtein Optimal String Alignment (OSA) comparisons between two strings.

__init__()

Create a new instance of DamerauOsa

_distance(string_1, string_2, len_1, len_2, start, char_1_costs, prev_char_1_costs)

Internal implementation of the core Damerau-Levenshtein, optimal string alignment algorithm.

From: https://github.com/softwx/SoftWx.Match

_distance_max(string_1, string_2, len_1, len_2, start, max_distance, char_1_costs, prev_char_1_costs)

Internal implementation of the core Damerau-Levenshtein, optimal string alignment algorithm that accepts a max_distance.

From: https://github.com/softwx/SoftWx.Match

distance(string_1, string_2, max_distance)

Compute and return the Damerau-Levenshtein optimal string alignment edit distance between two strings.

Args:

  • string_1 (str): One of the strings to compare.
  • string_2 (str): The other string to compare.
  • max_distance (int): The maximum distance that is of interest.

Returns: -1 if the distance is greater than the max_distance, 0 if the strings are equivalent, otherwise a positive number whose magnitude increases as difference between the strings increases.

class symspellpy.editdistance.DistanceAlgorithm

Supported edit distance algorithms

DAMERUAUOSA = 1

Damerau optimal string alignment algorithm

LEVENSHTEIN = 0

Levenshtein algorithm.

class symspellpy.editdistance.EditDistance(algorithm)

Edit distance algorithms.

Attributes:

  • _algorithm: A DistanceAlgorithm indicating the desired edit distance algorithm.
  • _distance_comparer: An object to compute the relative distance between two strings.
__init__(algorithm)

Create a new EditDistance object.

Args:

Raises:

  • ValueError: algorithm specifies an invalid distance algorithm.

NOTE: Levenshtein algorithm not yet implemented.

compare(string_1, string_2, max_distance)

Compare a string to the base string to determine the edit distance, using the previously selected algorithm.

Args:

  • string_1 (str): Base string.
  • string_2 (str): The string to compare.
  • max_distance (int): The maximum distance allowed.

Returns: The edit distance (or -1 if max_distance exceeded).

class symspellpy.editdistance.Levenshtein

Class providing Levenshtein algorithm for computing edit distance metric between two strings

__init__()

Create a new instance of Levenshtein

_distance(string_1, string_2, len_1, len_2, start, char_1_costs)

Internal implementation of the core Levenshtein algorithm.

From: https://github.com/softwx/SoftWx.Match

_distance_max(string_1, string_2, len_1, len_2, start, max_distance, char_1_costs)

Internal implementation of the core Levenshtein algorithm that accepts a max_distance.

From: https://github.com/softwx/SoftWx.Match

distance(string_1, string_2, max_distance)

Compute and return the Levenshtein edit distance between two strings.

Args:

  • string_1 (str): One of the strings to compare.
  • string2 (str): The other string to compare.
  • max_distance (int): The maximum distance that is of interest.

Returns: -1 if the distance is greater than the maxDistance, 0 if the strings are equivalent, otherwise a positive number whose magnitude increases as difference between the strings increases.

helpers.py

symspellpy.helpers.is_acronym(word)

Checks is the word is all caps (acronym) and/or contain numbers

Args:

word (str): The word to check

Returns: True if the word is all caps and/or contain numbers, e.g., ABCDE, AB12C. False if the word contains lower case letters, e.g., abcde, ABCde, abcDE, abCDe, abc12, ab12c

symspellpy.helpers.null_distance_results(string1, string2, max_distance)

Determines the proper return value of an edit distance function when one or both strings are null.

Args:

  • string_1 (str): Base string.
  • string_2 (str): The string to compare.
  • max_distance (int): The maximum distance allowed.

Returns: -1 if the distance is greater than the max_distance, 0 if the strings are equivalent (both are None), otherwise a positive number whose magnitude is the length of the string which is not None.

symspellpy.helpers.parse_words(phrase, preserve_case=False)

Create a non-unique wordlist from sample text. Language independent (e.g. works with Chinese characters)

Args:

  • phrase (str): Sample text that could contain one or more words
  • preserve_case (bool): A flag to determine if we can to preserve the cases or convert all to lowercase

Returns: A list of words

symspellpy.helpers.prefix_suffix_prep(string1, string2)

Calculates starting position and lengths of two strings such that common prefix and suffix substrings are excluded. Expects len(string1) <= len(string2)

Args:

  • string_1 (str): Base string.
  • string_2 (str): The string to compare.

Returns: Lengths of the part excluding common prefix and suffix, and the starting position

symspellpy.helpers.to_similarity(distance, length)

Calculate a similarity measure from an edit distance.

Args:

  • distance (int): The edit distance between two strings.
  • length (int): The length of the longer of the two strings the edit distance is from.

Returns: A similarity value from 0 to 1.0 (1 - (length / distance)), -1 if distance is negative

symspellpy.helpers.transfer_casing_for_matching_text(text_w_casing, text_wo_casing)

Transferring the casing from one text to another - assuming that they are ‘matching’ texts, alias they have the same length.

symspellpy.helpers.transfer_casing_for_similar_text(text_w_casing, text_wo_casing)

Transferring the casing from one text to another - for similar (not matching) text

  1. It will use difflib’s SequenceMatcher to identify the different type of changes needed to turn text1 into text2

  2. For each type of change: - for inserted sections:

    • it will transfer the casing from the prior character
    • if no character before or the character before is the space, then it will transfer the casing from the following character
    • for deleted sections: no case transfer is required
    • for equal sections: just swap out the text with the original, the one with the casings, as otherwise the two are the same
    • replaced sections: transfer the casing using the transfer_casing_for_matching_text() method if the two has the same length, otherwise transfer character-by-character and carry the last casing over to any additional characters
symspellpy.helpers.try_parse_int64(string)

Converts the string representation of a number to its 64-bit signed integer equivalent.

Args:

  • string (str): string representation of a number

Returns: The 64-bit signed integer equivalent, or None if conversion failed or if the number is less than the min value or greater than the max value of a 64-bit signed integer.

Indices and tables