symspellpy¶
Enum class¶
- class symspellpy.verbosity.Verbosity(*values)[source]¶
Controls the closeness/quantity of returned spelling suggestions.
- TOP¶
Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.
- CLOSEST¶
All suggestions of smallest edit distance found, suggestions ordered by term frequency.
- ALL¶
All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency (slower, no early termination).
Data class¶
- class symspellpy.suggest_item.SuggestItem(term, distance, count)[source]¶
Spelling suggestion returned from
lookup().- Parameters:
term (
str) – The suggested word.distance (
int) – Edit distance from search word.count (
int) – Frequency of suggestion in dictionary or Naive Bayes probability of the individual suggestion parts.
- __eq__(other)[source]¶
- Return type:
bool- Returns:
Trueif both distance and frequency count are the same.
- __lt__(other)[source]¶
- Return type:
bool- Returns:
Order by distance ascending, then by frequency count descending.
- property count: int¶
Frequency of suggestion in the dictionary (a measure of how common the word is) or Naive Bayes probability of the individual suggestion parts in
lookup_compound().
- classmethod create_with_probability(term, distance)[source]¶
Creates a SuggestItem with Naive Bayes probability as the count.
- Return type:
- property distance: int¶
Edit distance between searched for word and suggestion.
- property term: str¶
The suggested correctly spelled word.
- class symspellpy.composition.Composition(segmented_string: str = '', corrected_string: str = '', distance_sum: int = 0, log_prob_sum: float = 0)[source]¶
Used by
word_segmentation().- segmented_string¶
The word segmented string.
- corrected_string¶
The spelling corrected string.
- distance_sum¶
The sum of edit distance between input string and corrected string
- log_prob_sum¶
The sum of word occurrence probabilities in log scale (a measure of how common and probable the corrected segmentation is).
Utility class¶
- class symspellpy.pickle_mixin.PickleMixin[source]¶
Implements saving and loading pickle functionality for SymSpell.
- _load_pickle_stream(stream, from_bytes=False)[source]¶
Loads delete combination from stream as pickle. This will reduce the loading time compared to running
load_dictionary()again.NOTE: Prints warning if the current settings count_threshold, max_dictionary_edit_distance, and prefix_length are different from the loaded settings. Overwrite current settings with loaded settings.
- Parameters:
stream (
Union[bytes,IO[bytes]]) – The stream from which the pickle data is loaded.from_bytes (
bool) – Flag to determine if we are loading from bytes or file.
- Return type:
bool- Returns:
Trueif delete combinations are successfully loaded.
- _save_pickle_stream(stream=None, to_bytes=False)[source]¶
Pickles
_below_threshold_words,_bigrams,_deletes,_words, and_max_lengthinto a stream for quicker loading later.Pickles
_count_threshold,_max_dictionary_edit_distance, and_prefix_lengthto ensure consistent behavior.- Parameters:
stream (
Optional[IO[bytes]]) – The stream to store the pickle data.to_bytes (
bool) – Flag to determine by bytes string should be returned instead of wrting to file.
- Return type:
Optional[bytes]- Returns:
A byte string of the pickled data if
to_bytes=True.
- load_pickle(data, compressed=True, from_bytes=False)[source]¶
Loads delete combination from file as pickle. This will reduce the loading time compared to running
load_dictionary()again.- Parameters:
data (
Union[bytes,Path]) – Either bytes string to be used withfrom_bytes=Trueor the path+filename of the pickle file to be used withfrom_bytes=False.compressed (
bool) – A flag to determine whether to read the pickled data as compressed data.from_bytes (
bool) – Flag to determine if we are loading from bytes or file.
- Return type:
bool- Returns:
Trueif delete combinations are successfully loaded.
- save_pickle(filename=None, compressed=True, to_bytes=False)[source]¶
Pickles
_deletes,_words, and_max_lengthinto a stream for quicker loading later.- Parameters:
filename (
Optional[Path]) – The path+filename of the pickle file.compressed (
bool) – A flag to determine whether to compress the pickled data.to_bytes (
bool) – Flag to determine by bytes string should be returned instead of wrting to file.
- Return type:
Optional[bytes]- Returns:
A byte string of the pickled data if
to_bytes=True.
SymSpell¶
- class symspellpy.symspellpy.SymSpell(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1, distance_comparer=None)[source]¶
Symmetric Delete spelling correction algorithm.
initial_capacity from the original code is omitted since python cannot preallocate memory. compact_mask from the original code is omitted since we’re not mapping suggested corrections to hash codes.
- Parameters:
max_dictionary_edit_distance (
int) – Maximum edit distance for doing lookups.prefix_length (
int) – The length of word prefixes used for spell checking.count_threshold (
int) – The minimum frequency count for dictionary words to be considered correct spellings.
- _max_dictionary_edit_distance¶
Maximum dictionary term length.
- Type:
int
- _prefix_length¶
The length of word prefixes used for spell checking.
- Type:
int
- _count_threshold¶
A threshold may be specified, when a term occurs so frequently in the corpus that it is considered a valid word for spelling correction.
- Type:
int
- _distance_algorithm¶
Edit distance algorithms.
- Type:
- _max_length¶
Length of longest word in the dictionary.
- Type:
int
- Raises:
ValueError – If max_dictionary_edit_distance is negative.
ValueError – If prefix_length is less than 1 or not greater than max_dictionary_edit_distance.
ValueError – If count_threshold is negative.
- _delete_in_suggestion_prefix(delete, delete_len, suggestion, suggestion_len)[source]¶
Checks whether all delete chars are present in the suggestion prefix in correct order, otherwise this is just a hash collision.
NOTE: No longer used in the Python port.
- Return type:
bool
- _edits(word, edit_distance, delete_words, current_distance=0)[source]¶
Inexpensive and language independent: only deletes, no transposes + replaces + inserts replaces and inserts are expensive and language dependent.
- Return type:
set[str]
- _load_bigram_dictionary_stream(corpus_stream, term_index, count_index, separator=None)[source]¶
Loads multiple dictionary entries from a stream of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters:
corpus_stream (
IO[str]) – A file object of the dictionary.term_index (
int) – The column position of the word.count_index (
int) – The column position of the frequency count.separator (
Optional[str]) – Separator characters between term(s) and count.
- Returns:
Trueafter file object is loaded.
- _load_dictionary_stream(corpus_stream, term_index, count_index, separator=' ')[source]¶
Loads multiple dictionary entries from a stream of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters:
corpus_stream (
IO[str]) – A file object of the dictionary.term_index (
int) – The column position of the word.count_index (
int) – The column position of the frequency count.separator (
str) – Separator characters between term(s) and count.
- Return type:
bool- Returns:
Trueafter file object is loaded.
- static _parse_words(text)[source]¶
Creates a non-unique wordlist from sample text language independent (e.g. works with Chinese characters).
- Return type:
list[str]
- property below_threshold_words: dict[str, int]¶
dictionary of unique words that are below the count threshold for being considered correct spellings.
- property bigrams: dict[str, int]¶
dictionary of unique correct spelling bigrams, and the frequency count for each word.
- create_dictionary(corpus, encoding=None, errors=None)[source]¶
Loads multiple dictionary words from a file containing plain text.
NOTE: Merges with any dictionary data already loaded.
- Parameters:
corpus (
Union[Path,str,IO[str]]) – The path+filename of the file or afile object of the dictionary.encoding (
Optional[str]) – Text encoding of the corpus file. Default None.errors (
Optional[str]) – Determines how decoding errors are handled. Default None.
- Return type:
bool- Returns:
Trueif file loaded, orFalseif file not found.
- create_dictionary_entry(key, count)[source]¶
Creates/updates an entry in the dictionary.
For every word there are deletes with an edit distance of 1..max_edit_distance created and added to the dictionary. Every delete entry has a suggestions list, which points to the original term(s) it was created from. The dictionary may be dynamically updated (word frequency and new words) at any time by calling create_dictionary_entry.
- Parameters:
key (
str) – The word to add to dictionary.count (
int) – The frequency count for word.
- Return type:
bool- Returns:
Trueif the word was added as a new correctly spelled word, orFalseif the word is added as a below threshold word, or updates an existing correctly spelled word.
- delete_dictionary_entry(key)[source]¶
Deletes an entry in the dictionary.
If the deleted entry is the longest word, update
_max_lengthwith the next longest word.- Parameters:
key (
str) – The word to add to dictionary.- Return type:
bool- Returns:
Trueif the word is successfully deleted, orFalseif the word is not found.
- property deletes: dict[str, list[str]]¶
dictionary that contains a mapping of lists of suggested correction words to the original words and the deletes derived from them. A list of suggestions might have a single suggestion, or multiple suggestions.
- property entry_count: int¶
Number of unique correct spelling words.
- load_bigram_dictionary(corpus, term_index, count_index, separator=None, encoding=None)[source]¶
Loads multiple dictionary entries from a file of word/frequency count pairs.
NOTE: Frequency count should be an integer that fits within 64 bits.
NOTE: Merges with any dictionary data already loaded.
- Parameters:
corpus (
Union[Path,str]) – The path+filename of the file.term_index (
int) – The column position of the word.count_index (
int) – The column position of the frequency count.separator (
Optional[str]) – Separator characters between term(s) and count.encoding (
Optional[str]) – Text encoding of the dictionary file.
- Return type:
bool- Returns:
Trueif file loaded, orFalseif file not found.
- load_dictionary(corpus, term_index, count_index, separator=' ', encoding=None)[source]¶
Loads multiple dictionary entries from a file of word/frequency count pairs.
NOTE: Frequency count should be an integer that fits within 64 bits.
NOTE: Merges with any dictionary data already loaded.
- Parameters:
corpus (
Union[Path,str,IO[str]]) – The path+filename of the file or a file object of the dictionary.term_index (
int) – The column position of the word.count_index (
int) – The column position of the frequency count.separator (
str) – Separator characters between term(s) and count.encoding (
Optional[str]) – Text encoding of the dictionary file.
- Return type:
bool- Returns:
Trueif file loaded, orFalseif file not found.
- lookup(phrase, verbosity, max_edit_distance=None, include_unknown=False, ignore_token=None, transfer_casing=False)[source]¶
Finds suggested spellings for a given phrase word.
- Parameters:
phrase (
str) – The word being spell checked.verbosity (
Verbosity) – The value controlling the quantity/closeness of the returned suggestions.max_edit_distance (
Optional[int]) – The maximum edit distance between phrase and suggested words. set to_max_dictionary_edit_distanceby default.include_unknown (
bool) – A flag to determine whether to include phrase word in suggestions, if no words within edit distance found.ignore_token (
Optional[Pattern[str]]) – A regex pattern describing what words/phrases to ignore and leave unchanged.transfer_casing (
bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.
- Return type:
list[SuggestItem]- Returns:
A list of
SuggestItemobjects representing suggested correct spellings for the phrase word, sorted by edit distance, and secondarily by count frequency.- Raises:
ValueError – If max_edit_distance is greater than
_max_dictionary_edit_distance
- lookup_compound(phrase, max_edit_distance, ignore_non_words=False, transfer_casing=False, split_by_space=False, ignore_term_with_digits=False)[source]¶
lookup_compound supports compound aware automatic spelling correction of multi-word input strings with three cases:
mistakenly inserted space into a correct word led to two incorrect terms
mistakenly omitted space between two correct words led to one incorrect combined term
multiple independent input terms with/without spelling errors
Find suggested spellings for a multi-word input string (supports word splitting/merging).
- Parameters:
phrase (
str) – The string being spell checked.max_edit_distance (
int) – The maximum edit distance between input and suggested words.ignore_non_words (
bool) – A flag to determine whether numbers and acronyms are left alone during the spell checking process.transfer_casing (
bool) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.split_by_space (
bool) – Splits the phrase into words simply based on space.ignore_any_term_with_digits – A flag to determine whether any term with digits is left alone during the spell checking process. Only works when
ignore_non_words` is also ``True.
- Return type:
list[SuggestItem]- Returns:
A list of
SuggestItemobjects representing suggested correct spellings for phrase.
- property replaced_words: dict[str, SuggestItem]¶
dictionary corrected/modified words.
- property word_count: int¶
Number of unique correct spelling words.
- word_segmentation(phrase, max_edit_distance=None, max_segmentation_word_length=None, ignore_token=None)[source]¶
word_segmentation divides a string into words by inserting missing spaces at the appropriate positions misspelled words are corrected and do not affect segmentation existing spaces are allowed and considered for optimum segmentation.
word_segmentation uses a novel approach without recursion. https://medium.com/@wolfgarbe/fast-word-segmentation-for-noisy-text-2c2c41f9e8da While each string of length n can be segmented in 2^n−1 possible compositions https://en.wikipedia.org/wiki/Composition_(combinatorics) word_segmentation has a linear runtime O(n) to find the optimum composition.
Finds suggested spellings for a multi-word input string (supports word splitting/merging).
- Parameters:
phrase (
str) – The string being spell checked.max_segmentation_word_length (
Optional[int]) – The maximum word length that should be considered.max_edit_distance (
Optional[int]) – The maximum edit distance between input and corrected words (0=no correction/segmentation only).ignore_token (
Optional[Pattern[str]]) – A regex pattern describing what words/phrases to ignore and leave unchanged.
- Return type:
- Returns:
The word segmented string, the word segmented and spelling corrected string, the edit distance sum between input string and corrected string, the sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).
- property words: dict[str, int]¶
dictionary of unique correct spelling words, and the frequency count for each word.