symspellpy¶
Enum class¶
- class symspellpy.verbosity.Verbosity(value)[source]¶
Controls the closeness/quantity of returned spelling suggestions.
- TOP¶
Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.
- CLOSEST¶
All suggestions of smallest edit distance found, suggestions ordered by term frequency.
- ALL¶
All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency (slower, no early termination).
Data class¶
- class symspellpy.suggest_item.SuggestItem(term, distance, count)[source]¶
Spelling suggestion returned from
lookup()
.- Parameters
term (
str
) – The suggested word.distance (
int
) – Edit distance from search word.count (
int
) – Frequency of suggestion in dictionary or Naive Bayes probability of the individual suggestion parts.
- __eq__(other)[source]¶
- Return type
bool
- Returns
True
if both distance and frequency count are the same.
- __lt__(other)[source]¶
- Return type
bool
- Returns
Order by distance ascending, then by frequency count descending.
- property count: int¶
Frequency of suggestion in the dictionary (a measure of how common the word is) or Naive Bayes probability of the individual suggestion parts in
lookup_compound()
.- Return type
int
- classmethod create_with_probability(term, distance)[source]¶
Creates a SuggestItem with Naive Bayes probability as the count.
- Return type
- property distance: int¶
Edit distance between searched for word and suggestion.
- Return type
int
- property term: str¶
The suggested correctly spelled word.
- Return type
str
- class symspellpy.composition.Composition(segmented_string: str = '', corrected_string: str = '', distance_sum: int = 0, log_prob_sum: float = 0)[source]¶
Used by
word_segmentation()
.- segmented_string¶
The word segmented string.
- Type
str
- corrected_string¶
The spelling corrected string.
- Type
str
- distance_sum¶
The sum of edit distance between input string and corrected string
- Type
int
- log_prob_sum¶
The sum of word occurrence probabilities in log scale (a measure of how common and probable the corrected segmentation is).
- Type
float
Utility class¶
- class symspellpy.pickle_mixin.PickleMixin[source]¶
Implements saving and loading pickle functionality for SymSpell.
- _load_pickle_stream(stream, from_bytes=False)[source]¶
Loads delete combination from stream as pickle. This will reduce the loading time compared to running
load_dictionary()
again.NOTE: Prints warning if the current settings count_threshold, max_dictionary_edit_distance, and prefix_length are different from the loaded settings. Overwrite current settings with loaded settings.
- Parameters
stream (
Union
[bytes
,IO
[bytes
]]) – The stream from which the pickle data is loaded.from_bytes (
bool
) – Flag to determine if we are loading from bytes or file.
- Return type
bool
- Returns
True
if delete combinations are successfully loaded.
- _save_pickle_stream(stream=None, to_bytes=False)[source]¶
Pickles
_below_threshold_words
,_bigrams
,_deletes
,_words
, and_max_length
into a stream for quicker loading later.Pickles
_count_threshold
,_max_dictionary_edit_distance
, and_prefix_length
to ensure consistent behavior.- Parameters
stream (
Optional
[IO
[bytes
]]) – The stream to store the pickle data.to_bytes – Flag to determine by bytes string should be returned instead of wrting to file.
- Return type
Optional
[bytes
]- Returns
A byte string of the pickled data if
to_bytes=True
.
- load_pickle(data, compressed=True, from_bytes=False)[source]¶
Loads delete combination from file as pickle. This will reduce the loading time compared to running
load_dictionary()
again.- Parameters
data (
Union
[bytes
,Path
]) – Either bytes string to be used withfrom_bytes=True
or the path+filename of the pickle file to be used withfrom_bytes=False
.compressed (
bool
) – A flag to determine whether to read the pickled data as compressed data.from_bytes (
bool
) – Flag to determine if we are loading from bytes or file.
- Return type
bool
- Returns
True
if delete combinations are successfully loaded.
- save_pickle(filename=None, compressed=True, to_bytes=False)[source]¶
Pickles
_deletes
,_words
, and_max_length
into a stream for quicker loading later.- Parameters
filename (
Optional
[Path
]) – The path+filename of the pickle file.compressed (
bool
) – A flag to determine whether to compress the pickled data.to_bytes (
bool
) – Flag to determine by bytes string should be returned instead of wrting to file.
- Return type
Optional
[bytes
]- Returns
A byte string of the pickled data if
to_bytes=True
.
SymSpell¶
- class symspellpy.symspellpy.SymSpell(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)[source]¶
Symmetric Delete spelling correction algorithm.
initial_capacity from the original code is omitted since python cannot preallocate memory. compact_mask from the original code is omitted since we’re not mapping suggested corrections to hash codes.
- Parameters
max_dictionary_edit_distance (
int
) – Maximum edit distance for doing lookups.prefix_length (
int
) – The length of word prefixes used for spell checking.count_threshold (
int
) – The minimum frequency count for dictionary words to be considered correct spellings.
- _max_dictionary_edit_distance¶
Maximum dictionary term length.
- Type
int
- _prefix_length¶
The length of word prefixes used for spell checking.
- Type
int
- _count_threshold¶
A threshold may be specified, when a term occurs so frequently in the corpus that it is considered a valid word for spelling correction.
- Type
int
- _distance_algorithm¶
Edit distance algorithms.
- Type
- _max_length¶
Length of longest word in the dictionary.
- Type
int
- Raises
ValueError – If max_dictionary_edit_distance is negative.
ValueError – If prefix_length is less than 1 or not greater than max_dictionary_edit_distance.
ValueError – If count_threshold is negative.
- _delete_in_suggestion_prefix(delete, delete_len, suggestion, suggestion_len)[source]¶
Checks whether all delete chars are present in the suggestion prefix in correct order, otherwise this is just a hash collision.
NOTE: No longer used in the Python port.
- Return type
bool
- _edits(word, edit_distance, delete_words, current_distance=0)[source]¶
Inexpensive and language independent: only deletes, no transposes + replaces + inserts replaces and inserts are expensive and language dependent.
- Return type
Set
[str
]
- _load_bigram_dictionary_stream(corpus_stream, term_index, count_index, separator=None)[source]¶
Loads multiple dictionary entries from a stream of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters
corpus_stream (
IO
[str
]) – A file object of the dictionary.term_index (
int
) – The column position of the word.count_index (
int
) – The column position of the frequency count.separator (
Optional
[str
]) – Separator characters between term(s) and count.
- Returns
True
after file object is loaded.
- _load_dictionary_stream(corpus_stream, term_index, count_index, separator=' ')[source]¶
Loads multiple dictionary entries from a stream of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters
corpus_stream (
IO
[str
]) – A file object of the dictionary.term_index (
int
) – The column position of the word.count_index (
int
) – The column position of the frequency count.separator (
str
) – Separator characters between term(s) and count.
- Return type
bool
- Returns
True
after file object is loaded.
- static _parse_words(text)[source]¶
Creates a non-unique wordlist from sample text language independent (e.g. works with Chinese characters).
- Return type
List
[str
]
- property below_threshold_words: Dict[str, int]¶
Dictionary of unique words that are below the count threshold for being considered correct spellings.
- Return type
Dict
[str
,int
]
- property bigrams: Dict[str, int]¶
Dictionary of unique correct spelling bigrams, and the frequency count for each word.
- Return type
Dict
[str
,int
]
- create_dictionary(corpus, encoding=None)[source]¶
Loads multiple dictionary words from a file containing plain text.
NOTE: Merges with any dictionary data already loaded.
- Parameters
corpus (
Union
[Path
,str
,IO
[str
]]) – The path+filename of the file or afile object of the dictionary.encoding (
Optional
[str
]) – Text encoding of the corpus file.
- Return type
bool
- Returns
True
if file loaded, orFalse
if file not found.
- create_dictionary_entry(key, count)[source]¶
Creates/updates an entry in the dictionary.
For every word there are deletes with an edit distance of 1..max_edit_distance created and added to the dictionary. Every delete entry has a suggestions list, which points to the original term(s) it was created from. The dictionary may be dynamically updated (word frequency and new words) at any time by calling create_dictionary_entry.
- Parameters
key (
str
) – The word to add to dictionary.count (
int
) – The frequency count for word.
- Return type
bool
- Returns
True
if the word was added as a new correctly spelled word, orFalse
if the word is added as a below threshold word, or updates an existing correctly spelled word.
- delete_dictionary_entry(key)[source]¶
Deletes an entry in the dictionary.
If the deleted entry is the longest word, update
_max_length
with the next longest word.- Parameters
key (
str
) – The word to add to dictionary.- Return type
bool
- Returns
True
if the word is successfully deleted, orFalse
if the word is not found.
- property deletes: Dict[str, List[str]]¶
Dictionary that contains a mapping of lists of suggested correction words to the original words and the deletes derived from them. A list of suggestions might have a single suggestion, or multiple suggestions.
- Return type
Dict
[str
,List
[str
]]
- property distance_algorithm: symspellpy.editdistance.DistanceAlgorithm¶
The current distance algorithm.
- Return type
- property entry_count: int¶
Number of unique correct spelling words.
- Return type
int
- load_bigram_dictionary(corpus, term_index, count_index, separator=None, encoding=None)[source]¶
Loads multiple dictionary entries from a file of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters
corpus (
Union
[Path
,str
]) – The path+filename of the file.term_index (
int
) – The column position of the word.count_index (
int
) – The column position of the frequency count.separator (
Optional
[str
]) – Separator characters between term(s) and count.encoding (
Optional
[str
]) – Text encoding of the dictionary file.
- Return type
bool
- Returns
True
if file loaded, orFalse
if file not found.
- load_dictionary(corpus, term_index, count_index, separator=' ', encoding=None)[source]¶
Loads multiple dictionary entries from a file of word/frequency count pairs.
NOTE: Merges with any dictionary data already loaded.
- Parameters
corpus (
Union
[Path
,str
]) – The path+filename of the file.term_index (
int
) – The column position of the word.count_index (
int
) – The column position of the frequency count.separator (
str
) – Separator characters between term(s) and count.encoding (
Optional
[str
]) – Text encoding of the dictionary file.
- Returns
True
if file loaded, orFalse
if file not found.
- lookup(phrase, verbosity, max_edit_distance=None, include_unknown=False, ignore_token=None, transfer_casing=False)[source]¶
Finds suggested spellings for a given phrase word.
- Parameters
phrase (
str
) – The word being spell checked.verbosity (
Verbosity
) – The value controlling the quantity/closeness of the returned suggestions.max_edit_distance (
Optional
[int
]) – The maximum edit distance between phrase and suggested words. Set to_max_dictionary_edit_distance
by default.include_unknown (
bool
) – A flag to determine whether to include phrase word in suggestions, if no words within edit distance found.ignore_token (
Optional
[Pattern
[str
]]) – A regex pattern describing what words/phrases to ignore and leave unchanged.transfer_casing (
bool
) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.
- Return type
List
[SuggestItem
]- Returns
A list of
SuggestItem
objects representing suggested correct spellings for the phrase word, sorted by edit distance, and secondarily by count frequency.- Raises
ValueError – If max_edit_distance is greater than
_max_dictionary_edit_distance
- lookup_compound(phrase, max_edit_distance, ignore_non_words=False, transfer_casing=False, split_by_space=False, ignore_term_with_digits=False)[source]¶
lookup_compound supports compound aware automatic spelling correction of multi-word input strings with three cases:
mistakenly inserted space into a correct word led to two incorrect terms
mistakenly omitted space between two correct words led to one incorrect combined term
multiple independent input terms with/without spelling errors
Find suggested spellings for a multi-word input string (supports word splitting/merging).
- Parameters
phrase (
str
) – The string being spell checked.max_edit_distance (
int
) – The maximum edit distance between input and suggested words.ignore_non_words (
bool
) – A flag to determine whether numbers and acronyms are left alone during the spell checking process.transfer_casing (
bool
) – A flag to determine whether the casing — i.e., uppercase vs lowercase — should be carried over from phrase.split_by_space (
bool
) – Splits the phrase into words simply based on space.ignore_any_term_with_digits – A flag to determine whether any term with digits is left alone during the spell checking process. Only works when
ignore_non_words` is also ``True
.
- Return type
List
[SuggestItem
]- Returns
A list of
SuggestItem
objects representing suggested correct spellings for phrase.
- property replaced_words: Dict[str, symspellpy.suggest_item.SuggestItem]¶
Dictionary corrected/modified words.
- Return type
Dict
[str
,SuggestItem
]
- property word_count: int¶
Number of unique correct spelling words.
- Return type
int
- word_segmentation(phrase, max_edit_distance=None, max_segmentation_word_length=None, ignore_token=None)[source]¶
word_segmentation divides a string into words by inserting missing spaces at the appropriate positions misspelled words are corrected and do not affect segmentation existing spaces are allowed and considered for optimum segmentation.
word_segmentation uses a novel approach without recursion. https://medium.com/@wolfgarbe/fast-word-segmentation-for-noisy-text-2c2c41f9e8da While each string of length n can be segmented in 2^n−1 possible compositions https://en.wikipedia.org/wiki/Composition_(combinatorics) word_segmentation has a linear runtime O(n) to find the optimum composition.
Finds suggested spellings for a multi-word input string (supports word splitting/merging).
- Parameters
phrase (
str
) – The string being spell checked.max_segmentation_word_length (
Optional
[int
]) – The maximum word length that should be considered.max_edit_distance (
Optional
[int
]) – The maximum edit distance between input and corrected words (0=no correction/segmentation only).ignore_token (
Optional
[Pattern
]) – A regex pattern describing what words/phrases to ignore and leave unchanged.
- Return type
- Returns
The word segmented string, the word segmented and spelling corrected string, the edit distance sum between input string and corrected string, the sum of word occurence probabilities in log scale (a measure of how common and probable the corrected segmentation is).
- property words: Dict[str, int]¶
Dictionary of unique correct spelling words, and the frequency count for each word.
- Return type
Dict
[str
,int
]