symspellpy¶

Enum class¶

class symspellpy.verbosity.Verbosity(value)[source]¶

Controls the closeness/quantity of returned spelling suggestions.

TOP¶: Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.

CLOSEST¶: All suggestions of smallest edit distance found, suggestions ordered by term frequency.

ALL¶: All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency (slower, no early termination).

Data class¶

class symspellpy.suggest_item.SuggestItem(term, distance, count)[source]¶

Spelling suggestion returned from lookup().

Parameters

term (str) – The suggested word.
distance (int) – Edit distance from search word.
count (int) – Frequency of suggestion in dictionary or Naive Bayes probability of the individual suggestion parts.

__eq__(other)[source]¶

Return type: bool
Returns: True if both distance and frequency count are the same.

__lt__(other)[source]¶

Return type: bool
Returns: Order by distance ascending, then by frequency count descending.

__str__()[source]¶

Return type: str
Returns: Displays attributes as “term, distance, count”.

property count: int¶

Frequency of suggestion in the dictionary (a measure of how common the word is) or Naive Bayes probability of the individual suggestion parts in lookup_compound().

Return type: int

classmethod create_with_probability(term, distance)[source]¶

Creates a SuggestItem with Naive Bayes probability as the count.

Return type: SuggestItem

property distance: int¶

Edit distance between searched for word and suggestion.

Return type: int

property term: str¶

The suggested correctly spelled word.

Return type: str

class symspellpy.composition.Composition(segmented_string: str = '', corrected_string: str = '', distance_sum: int = 0, log_prob_sum: float = 0)[source]¶

Used by word_segmentation().

segmented_string¶

The word segmented string.

Type: str

corrected_string¶

The spelling corrected string.

Type: str

distance_sum¶

The sum of edit distance between input string and corrected string

Type: int

log_prob_sum¶

The sum of word occurrence probabilities in log scale (a measure of how common and probable the corrected segmentation is).

Type: float

classmethod create(composition, segmented_part, corrected_part, distance, log_prob)[source]¶

Creates a Composition by appending to an existing Composition.

Return type: Composition

Utility class¶

class symspellpy.pickle_mixin.PickleMixin[source]¶

Implements saving and loading pickle functionality for SymSpell.

_load_pickle_stream(stream, from_bytes=False)[source]¶

Loads delete combination from stream as pickle. This will reduce the loading time compared to running load_dictionary() again.

NOTE: Prints warning if the current settings count_threshold, max_dictionary_edit_distance, and prefix_length are different from the loaded settings. Overwrite current settings with loaded settings.

Parameters

stream (Union[bytes, IO[bytes]]) – The stream from which the pickle data is loaded.
from_bytes (bool) – Flag to determine if we are loading from bytes or file.

Return type

bool

Returns

True if delete combinations are successfully loaded.

_save_pickle_stream(stream=None, to_bytes=False)[source]¶

Pickles _below_threshold_words, _bigrams, _deletes, _words, and _max_length into a stream for quicker loading later.

Pickles _count_threshold, _max_dictionary_edit_distance, and _prefix_length to ensure consistent behavior.

Parameters

stream (Optional[IO[bytes]]) – The stream to store the pickle data.
to_bytes – Flag to determine by bytes string should be returned instead of wrting to file.

Return type

Optional[bytes]

Returns

A byte string of the pickled data if to_bytes=True.

load_pickle(data, compressed=True, from_bytes=False)[source]¶

Loads delete combination from file as pickle. This will reduce the loading time compared to running load_dictionary() again.

Parameters

data (Union[bytes, Path]) – Either bytes string to be used with from_bytes=True or the path+filename of the pickle file to be used with from_bytes=False.
compressed (bool) – A flag to determine whether to read the pickled data as compressed data.
from_bytes (bool) – Flag to determine if we are loading from bytes or file.

Return type

bool

Returns

True if delete combinations are successfully loaded.

save_pickle(filename=None, compressed=True, to_bytes=False)[source]¶

Pickles _deletes, _words, and _max_length into a stream for quicker loading later.

Parameters

filename (Optional[Path]) – The path+filename of the pickle file.
compressed (bool) – A flag to determine whether to compress the pickled data.
to_bytes (bool) – Flag to determine by bytes string should be returned instead of wrting to file.

Return type

Optional[bytes]

Returns

A byte string of the pickled data if to_bytes=True.

SymSpell¶

class symspellpy.symspellpy.SymSpell(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)[source]¶

Symmetric Delete spelling correction algorithm.

initial_capacity from the original code is omitted since python cannot preallocate memory. compact_mask from the original code is omitted since we’re not mapping suggested corrections to hash codes.

Parameters

max_dictionary_edit_distance (int) – Maximum edit distance for doing lookups.
prefix_length (int) – The length of word prefixes used for spell checking.
count_threshold (int) – The minimum frequency count for dictionary words to be considered correct spellings.

_max_dictionary_edit_distance¶

Maximum dictionary term length.

Type: int

_prefix_length¶

The length of word prefixes used for spell checking.

Type: int

_count_threshold¶

A threshold may be specified, when a term occurs so frequently in the corpus that it is considered a valid word for spelling correction.

Type: int

_distance_algorithm¶

Edit distance algorithms.

Type: DistanceAlgorithm

_max_length¶

Length of longest word in the dictionary.

Type: int

Raises

ValueError – If max_dictionary_edit_distance is negative.
ValueError – If prefix_length is less than 1 or not greater than max_dictionary_edit_distance.
ValueError – If count_threshold is negative.

_delete_in_suggestion_prefix(delete, delete_len, suggestion, suggestion_len)[source]¶

Checks whether all delete chars are present in the suggestion prefix in correct order, otherwise this is just a hash collision.

NOTE: No longer used in the Python port.

Return type: bool

_edits(word, edit_distance, delete_words, current_distance=0)[source]¶

Inexpensive and language independent: only deletes, no transposes + replaces + inserts replaces and inserts are expensive and language dependent.

Return type: Set[str]

_load_bigram_dictionary_stream(corpus_stream, term_index, count_index, separator=None)[source]¶

Loads multiple dictionary entries from a stream of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters

corpus_stream (IO[str]) – A file object of the dictionary.
term_index (int) – The column position of the word.
count_index (int) – The column position of the frequency count.
separator (Optional[str]) – Separator characters between term(s) and count.

Returns

True after file object is loaded.

_load_dictionary_stream(corpus_stream, term_index, count_index, separator=' ')[source]¶

Loads multiple dictionary entries from a stream of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters

corpus_stream (IO[str]) – A file object of the dictionary.
term_index (int) – The column position of the word.
count_index (int) – The column position of the frequency count.
separator (str) – Separator characters between term(s) and count.

Return type

bool

Returns

True after file object is loaded.

static _parse_words(text)[source]¶

Creates a non-unique wordlist from sample text language independent (e.g. works with Chinese characters).

Return type: List[str]

property below_threshold_words: Dict[str, int]¶

Dictionary of unique words that are below the count threshold for being considered correct spellings.

Return type: Dict[str, int]

property bigrams: Dict[str, int]¶

Dictionary of unique correct spelling bigrams, and the frequency count for each word.

Return type: Dict[str, int]

create_dictionary(corpus, encoding=None)[source]¶

Loads multiple dictionary words from a file containing plain text.

NOTE: Merges with any dictionary data already loaded.

Parameters

corpus (Union[Path, str, IO[str]]) – The path+filename of the file or afile object of the dictionary.
encoding (Optional[str]) – Text encoding of the corpus file.

Return type

bool

Returns

True if file loaded, or False if file not found.

create_dictionary_entry(key, count)[source]¶

Creates/updates an entry in the dictionary.

For every word there are deletes with an edit distance of 1..max_edit_distance created and added to the dictionary. Every delete entry has a suggestions list, which points to the original term(s) it was created from. The dictionary may be dynamically updated (word frequency and new words) at any time by calling create_dictionary_entry.

Parameters

key (str) – The word to add to dictionary.
count (int) – The frequency count for word.

Return type

bool

Returns

True if the word was added as a new correctly spelled word, or False if the word is added as a below threshold word, or updates an existing correctly spelled word.

delete_dictionary_entry(key)[source]¶

Deletes an entry in the dictionary.

If the deleted entry is the longest word, update _max_length with the next longest word.

Parameters: key (str) – The word to add to dictionary.
Return type: bool
Returns: True if the word is successfully deleted, or False if the word is not found.

property deletes: Dict[str, List[str]]¶

Dictionary that contains a mapping of lists of suggested correction words to the original words and the deletes derived from them. A list of suggestions might have a single suggestion, or multiple suggestions.

Return type: Dict[str, List[str]]

property distance_algorithm: symspellpy.editdistance.DistanceAlgorithm¶

The current distance algorithm.

Return type: DistanceAlgorithm

property entry_count: int¶

Number of unique correct spelling words.

Return type: int

load_bigram_dictionary(corpus, term_index, count_index, separator=None, encoding=None)[source]¶

Loads multiple dictionary entries from a file of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters

corpus (Union[Path, str]) – The path+filename of the file.
term_index (int) – The column position of the word.
count_index (int) – The column position of the frequency count.
separator (Optional[str]) – Separator characters between term(s) and count.
encoding (Optional[str]) – Text encoding of the dictionary file.

Return type

bool

Returns

True if file loaded, or False if file not found.

load_dictionary(corpus, term_index, count_index, separator=' ', encoding=None)[source]¶

Loads multiple dictionary entries from a file of word/frequency count pairs.

NOTE: Merges with any dictionary data already loaded.

Parameters

corpus (Union[Path, str]) – The path+filename of the file.
term_index (int) – The column position of the word.
count_index (int) – The column position of the frequency count.
separator (str) – Separator characters between term(s) and count.
encoding (Optional[str]) – Text encoding of the dictionary file.

Returns

True if file loaded, or False if file not found.

lookup(phrase, verbosity, max_edit_distance=None, include_unknown=False, ignore_token=None, transfer_casing=False)[source]¶

Finds suggested spellings for a given phrase word.

Parameters

phrase (str) – The word being spell checked.
verbosity (Verbosity) – The value controlling the quantity/closeness of the returned suggestions.
max_edit_distance (Optional[int]) – The maximum edit distance between phrase and suggested words. Set to _max_dictionary_edit_distance by default.
include_unknown (bool) – A flag to determine whether to include phrase word in suggestions, if no words within edit distance found.
ignore_token (Optional[Pattern[str]]) – A regex pattern describing what words/phrases to ignore and leave unchanged.
transfer_casing (bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.

Return type

List[SuggestItem]

Returns

A list of SuggestItem objects representing suggested correct spellings for the phrase word, sorted by edit distance, and secondarily by count frequency.

Raises

ValueError – If max_edit_distance is greater than _max_dictionary_edit_distance

lookup_compound(phrase, max_edit_distance, ignore_non_words=False, transfer_casing=False, split_by_space=False, ignore_term_with_digits=False)[source]¶

lookup_compound supports compound aware automatic spelling correction of multi-word input strings with three cases:

mistakenly inserted space into a correct word led to two incorrect terms
mistakenly omitted space between two correct words led to one incorrect combined term
multiple independent input terms with/without spelling errors

Find suggested spellings for a multi-word input string (supports word splitting/merging).

Parameters

phrase (str) – The string being spell checked.
max_edit_distance (int) – The maximum edit distance between input and suggested words.
ignore_non_words (bool) – A flag to determine whether numbers and acronyms are left alone during the spell checking process.
transfer_casing (bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.
split_by_space (bool) – Splits the phrase into words simply based on space.
ignore_any_term_with_digits – A flag to determine whether any term with digits is left alone during the spell checking process. Only works when ignore_non_words` is also ``True.

Return type

List[SuggestItem]

Returns

A list of SuggestItem objects representing suggested correct spellings for phrase.

property replaced_words: Dict[str, symspellpy.suggest_item.SuggestItem]¶

Dictionary corrected/modified words.

Return type: Dict[str, SuggestItem]

property word_count: int¶

Number of unique correct spelling words.

Return type: int

word_segmentation(phrase, max_edit_distance=None, max_segmentation_word_length=None, ignore_token=None)[source]¶

word_segmentation divides a string into words by inserting missing spaces at the appropriate positions misspelled words are corrected and do not affect segmentation existing spaces are allowed and considered for optimum segmentation.

word_segmentation uses a novel approach without recursion. https://medium.com/@wolfgarbe/fast-word-segmentation-for-noisy-text-2c2c41f9e8da While each string of length n can be segmented in 2^n−1 possible compositions https://en.wikipedia.org/wiki/Composition_(combinatorics) word_segmentation has a linear runtime O(n) to find the optimum composition.

Finds suggested spellings for a multi-word input string (supports word splitting/merging).

Parameters

phrase (str) – The string being spell checked.
max_segmentation_word_length (Optional[int]) – The maximum word length that should be considered.
max_edit_distance (Optional[int]) – The maximum edit distance between input and corrected words (0=no correction/segmentation only).
ignore_token (Optional[Pattern]) – A regex pattern describing what words/phrases to ignore and leave unchanged.

Return type

Composition

Returns

The word segmented string, the word segmented and spelling corrected string, the edit distance sum between input string and corrected string, the sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).

property words: Dict[str, int]¶

Dictionary of unique correct spelling words, and the frequency count for each word.

Return type: Dict[str, int]