symspellpy

Enum class

class symspellpy.verbosity.Verbosity(value)[source]

Controls the closeness/quantity of returned spelling suggestions.

TOP

Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.

CLOSEST

All suggestions of smallest edit distance found, suggestions ordered by term frequency.

ALL

All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency (slower, no early termination).

Data class

class symspellpy.suggest_item.SuggestItem(term, distance, count)[source]

Spelling suggestion returned from lookup().

Parameters
  • term (str) – The suggested word.

  • distance (int) – Edit distance from search word.

  • count (int) – Frequency of suggestion in dictionary or Naive Bayes probability of the individual suggestion parts.

__eq__(other)[source]
Return type

bool

Returns

True if both distance and frequency count are the same.

__lt__(other)[source]
Return type

bool

Returns

Order by distance ascending, then by frequency count descending.

__str__()[source]
Return type

str

Returns

Displays attributes as “term, distance, count”.

property count: int

Frequency of suggestion in the dictionary (a measure of how common the word is) or Naive Bayes probability of the individual suggestion parts in lookup_compound().

Return type

int

classmethod create_with_probability(term, distance)[source]

Creates a SuggestItem with Naive Bayes probability as the count.

Return type

SuggestItem

property distance: int

Edit distance between searched for word and suggestion.

Return type

int

property term: str

The suggested correctly spelled word.

Return type

str

class symspellpy.composition.Composition(segmented_string: str = '', corrected_string: str = '', distance_sum: int = 0, log_prob_sum: float = 0)[source]

Used by word_segmentation().

segmented_string

The word segmented string.

Type

str

corrected_string

The spelling corrected string.

Type

str

distance_sum

The sum of edit distance between input string and corrected string

Type

int

log_prob_sum

The sum of word occurrence probabilities in log scale (a measure of how common and probable the corrected segmentation is).

Type

float

classmethod create(composition, segmented_part, corrected_part, distance, log_prob)[source]

Creates a Composition by appending to an existing Composition.

Return type

Composition

Utility class

class symspellpy.pickle_mixin.PickleMixin[source]

Implements saving and loading pickle functionality for SymSpell.

_load_pickle_stream(stream, from_bytes=False)[source]

Loads delete combination from stream as pickle. This will reduce the loading time compared to running load_dictionary() again.

NOTE: Prints warning if the current settings count_threshold, max_dictionary_edit_distance, and prefix_length are different from the loaded settings. Overwrite current settings with loaded settings.

Parameters
  • stream (Union[bytes, IO[bytes]]) – The stream from which the pickle data is loaded.

  • from_bytes (bool) – Flag to determine if we are loading from bytes or file.

Return type

bool

Returns

True if delete combinations are successfully loaded.

_save_pickle_stream(stream=None, to_bytes=False)[source]

Pickles _below_threshold_words, _bigrams, _deletes, _words, and _max_length into a stream for quicker loading later.

Pickles _count_threshold, _max_dictionary_edit_distance, and _prefix_length to ensure consistent behavior.

Parameters
  • stream (Optional[IO[bytes]]) – The stream to store the pickle data.

  • to_bytes – Flag to determine by bytes string should be returned instead of wrting to file.

Return type

Optional[bytes]

Returns

A byte string of the pickled data if to_bytes=True.

load_pickle(data, compressed=True, from_bytes=False)[source]

Loads delete combination from file as pickle. This will reduce the loading time compared to running load_dictionary() again.

Parameters
  • data (Union[bytes, Path]) – Either bytes string to be used with from_bytes=True or the path+filename of the pickle file to be used with from_bytes=False.

  • compressed (bool) – A flag to determine whether to read the pickled data as compressed data.

  • from_bytes (bool) – Flag to determine if we are loading from bytes or file.

Return type

bool

Returns

True if delete combinations are successfully loaded.

save_pickle(filename=None, compressed=True, to_bytes=False)[source]

Pickles _deletes, _words, and _max_length into a stream for quicker loading later.

Parameters
  • filename (Optional[Path]) – The path+filename of the pickle file.

  • compressed (bool) – A flag to determine whether to compress the pickled data.

  • to_bytes (bool) – Flag to determine by bytes string should be returned instead of wrting to file.

Return type

Optional[bytes]

Returns

A byte string of the pickled data if to_bytes=True.

SymSpell

class symspellpy.symspellpy.SymSpell(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)[source]

Symmetric Delete spelling correction algorithm.

initial_capacity from the original code is omitted since python cannot preallocate memory. compact_mask from the original code is omitted since we’re not mapping suggested corrections to hash codes.

Parameters
  • max_dictionary_edit_distance (int) – Maximum edit distance for doing lookups.

  • prefix_length (int) – The length of word prefixes used for spell checking.

  • count_threshold (int) – The minimum frequency count for dictionary words to be considered correct spellings.

_max_dictionary_edit_distance

Maximum dictionary term length.

Type

int

_prefix_length

The length of word prefixes used for spell checking.

Type

int

_count_threshold

A threshold may be specified, when a term occurs so frequently in the corpus that it is considered a valid word for spelling correction.

Type

int

_distance_algorithm

Edit distance algorithms.

Type

DistanceAlgorithm

_max_length

Length of longest word in the dictionary.

Type

int

Raises
  • ValueError – If max_dictionary_edit_distance is negative.

  • ValueError – If prefix_length is less than 1 or not greater than max_dictionary_edit_distance.

  • ValueError – If count_threshold is negative.

_delete_in_suggestion_prefix(delete, delete_len, suggestion, suggestion_len)[source]

Checks whether all delete chars are present in the suggestion prefix in correct order, otherwise this is just a hash collision.

NOTE: No longer used in the Python port.

Return type

bool

_edits(word, edit_distance, delete_words, current_distance=0)[source]

Inexpensive and language independent: only deletes, no transposes + replaces + inserts replaces and inserts are expensive and language dependent.

Return type

Set[str]

_load_bigram_dictionary_stream(corpus_stream, term_index, count_index, separator=None)[source]

Loads multiple dictionary entries from a stream of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters
  • corpus_stream (IO[str]) – A file object of the dictionary.

  • term_index (int) – The column position of the word.

  • count_index (int) – The column position of the frequency count.

  • separator (Optional[str]) – Separator characters between term(s) and count.

Returns

True after file object is loaded.

_load_dictionary_stream(corpus_stream, term_index, count_index, separator=' ')[source]

Loads multiple dictionary entries from a stream of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters
  • corpus_stream (IO[str]) – A file object of the dictionary.

  • term_index (int) – The column position of the word.

  • count_index (int) – The column position of the frequency count.

  • separator (str) – Separator characters between term(s) and count.

Return type

bool

Returns

True after file object is loaded.

static _parse_words(text)[source]

Creates a non-unique wordlist from sample text language independent (e.g. works with Chinese characters).

Return type

List[str]

property below_threshold_words: Dict[str, int]

Dictionary of unique words that are below the count threshold for being considered correct spellings.

Return type

Dict[str, int]

property bigrams: Dict[str, int]

Dictionary of unique correct spelling bigrams, and the frequency count for each word.

Return type

Dict[str, int]

create_dictionary(corpus, encoding=None)[source]

Loads multiple dictionary words from a file containing plain text.

NOTE: Merges with any dictionary data already loaded.

Parameters
  • corpus (Union[Path, str, IO[str]]) – The path+filename of the file or afile object of the dictionary.

  • encoding (Optional[str]) – Text encoding of the corpus file.

Return type

bool

Returns

True if file loaded, or False if file not found.

create_dictionary_entry(key, count)[source]

Creates/updates an entry in the dictionary.

For every word there are deletes with an edit distance of 1..max_edit_distance created and added to the dictionary. Every delete entry has a suggestions list, which points to the original term(s) it was created from. The dictionary may be dynamically updated (word frequency and new words) at any time by calling create_dictionary_entry.

Parameters
  • key (str) – The word to add to dictionary.

  • count (int) – The frequency count for word.

Return type

bool

Returns

True if the word was added as a new correctly spelled word, or False if the word is added as a below threshold word, or updates an existing correctly spelled word.

delete_dictionary_entry(key)[source]

Deletes an entry in the dictionary.

If the deleted entry is the longest word, update _max_length with the next longest word.

Parameters

key (str) – The word to add to dictionary.

Return type

bool

Returns

True if the word is successfully deleted, or False if the word is not found.

property deletes: Dict[str, List[str]]

Dictionary that contains a mapping of lists of suggested correction words to the original words and the deletes derived from them. A list of suggestions might have a single suggestion, or multiple suggestions.

Return type

Dict[str, List[str]]

property distance_algorithm: symspellpy.editdistance.DistanceAlgorithm

The current distance algorithm.

Return type

DistanceAlgorithm

property entry_count: int

Number of unique correct spelling words.

Return type

int

load_bigram_dictionary(corpus, term_index, count_index, separator=None, encoding=None)[source]

Loads multiple dictionary entries from a file of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters
  • corpus (Union[Path, str]) – The path+filename of the file.

  • term_index (int) – The column position of the word.

  • count_index (int) – The column position of the frequency count.

  • separator (Optional[str]) – Separator characters between term(s) and count.

  • encoding (Optional[str]) – Text encoding of the dictionary file.

Return type

bool

Returns

True if file loaded, or False if file not found.

load_dictionary(corpus, term_index, count_index, separator=' ', encoding=None)[source]

Loads multiple dictionary entries from a file of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters
  • corpus (Union[Path, str]) – The path+filename of the file.

  • term_index (int) – The column position of the word.

  • count_index (int) – The column position of the frequency count.

  • separator (str) – Separator characters between term(s) and count.

  • encoding (Optional[str]) – Text encoding of the dictionary file.

Returns

True if file loaded, or False if file not found.

lookup(phrase, verbosity, max_edit_distance=None, include_unknown=False, ignore_token=None, transfer_casing=False)[source]

Finds suggested spellings for a given phrase word.

Parameters
  • phrase (str) – The word being spell checked.

  • verbosity (Verbosity) – The value controlling the quantity/closeness of the returned suggestions.

  • max_edit_distance (Optional[int]) – The maximum edit distance between phrase and suggested words. Set to _max_dictionary_edit_distance by default.

  • include_unknown (bool) – A flag to determine whether to include phrase word in suggestions, if no words within edit distance found.

  • ignore_token (Optional[Pattern[str]]) – A regex pattern describing what words/phrases to ignore and leave unchanged.

  • transfer_casing (bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.

Return type

List[SuggestItem]

Returns

A list of SuggestItem objects representing suggested correct spellings for the phrase word, sorted by edit distance, and secondarily by count frequency.

Raises

ValueError – If max_edit_distance is greater than _max_dictionary_edit_distance

lookup_compound(phrase, max_edit_distance, ignore_non_words=False, transfer_casing=False, split_by_space=False, ignore_term_with_digits=False)[source]

lookup_compound supports compound aware automatic spelling correction of multi-word input strings with three cases:

  1. mistakenly inserted space into a correct word led to two incorrect terms

  2. mistakenly omitted space between two correct words led to one incorrect combined term

  3. multiple independent input terms with/without spelling errors

Find suggested spellings for a multi-word input string (supports word splitting/merging).

Parameters
  • phrase (str) – The string being spell checked.

  • max_edit_distance (int) – The maximum edit distance between input and suggested words.

  • ignore_non_words (bool) – A flag to determine whether numbers and acronyms are left alone during the spell checking process.

  • transfer_casing (bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.

  • split_by_space (bool) – Splits the phrase into words simply based on space.

  • ignore_any_term_with_digits – A flag to determine whether any term with digits is left alone during the spell checking process. Only works when ignore_non_words` is also ``True.

Return type

List[SuggestItem]

Returns

A list of SuggestItem objects representing suggested correct spellings for phrase.

property replaced_words: Dict[str, symspellpy.suggest_item.SuggestItem]

Dictionary corrected/modified words.

Return type

Dict[str, SuggestItem]

property word_count: int

Number of unique correct spelling words.

Return type

int

word_segmentation(phrase, max_edit_distance=None, max_segmentation_word_length=None, ignore_token=None)[source]

word_segmentation divides a string into words by inserting missing spaces at the appropriate positions misspelled words are corrected and do not affect segmentation existing spaces are allowed and considered for optimum segmentation.

word_segmentation uses a novel approach without recursion. https://medium.com/@wolfgarbe/fast-word-segmentation-for-noisy-text-2c2c41f9e8da While each string of length n can be segmented in 2^n−1 possible compositions https://en.wikipedia.org/wiki/Composition_(combinatorics) word_segmentation has a linear runtime O(n) to find the optimum composition.

Finds suggested spellings for a multi-word input string (supports word splitting/merging).

Parameters
  • phrase (str) – The string being spell checked.

  • max_segmentation_word_length (Optional[int]) – The maximum word length that should be considered.

  • max_edit_distance (Optional[int]) – The maximum edit distance between input and corrected words (0=no correction/segmentation only).

  • ignore_token (Optional[Pattern]) – A regex pattern describing what words/phrases to ignore and leave unchanged.

Return type

Composition

Returns

The word segmented string, the word segmented and spelling corrected string, the edit distance sum between input string and corrected string, the sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).

property words: Dict[str, int]

Dictionary of unique correct spelling words, and the frequency count for each word.

Return type

Dict[str, int]