统计句子建议模型,例如拼写检查 [英] Statistical sentence suggestion model like spell checking
问题描述
已经有可用的拼写检查模型,这些模型可以帮助我们根据经过训练的正确拼写的语料库找到建议的正确拼写.可以将粒度从字母增加到单词",以便我们甚至可以提供短语建议,这样,如果输入了不正确的短语,则它应该建议与正确短语的语料库中最接近的正确短语,当然,它是从有效短语列表.
There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
是否存在任何已经实现此功能的python库,或者如何针对现有的大型金标准短语语料库进行此操作以获取统计上相关的建议?
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
注意:这与拼写检查器不同,因为拼写检查器中的字母是有限的,而在短语校正器中,字母本身是一个单词,因此从理论上讲是无限的,但是我们可以限制短语库中单词的数量. /p>
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
推荐答案
您要构建的是一个N元语法模型,该模型包含计算每个单词遵循n个单词序列的概率.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
您可以使用 NLTK文本语料库来训练模型,也可以标记化您的模型nltk.sent_tokenize(text)
和nltk.word_tokenize(sentence)
拥有自己的语料库.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text)
and nltk.word_tokenize(sentence)
.
您可以考虑使用2克(马尔可夫模型):
You can consider 2-gram (Markov model):
小猫"跟随可爱"的可能性是多少?
What is the probability for "kitten" to follow "cute"?
...或3克:
小猫"跟随可爱"的概率是多少?
What is the probability for "kitten" to follow "the cute"?
等
显然,使用n + 1-gram训练模型比使用n-gram花费更高.
Obviously training the model with n+1-gram is costlier than n-gram.
您可以考虑使用(word, pos)
对,其中pos
是词性标签(您可以使用nltk.pos_tag(tokens)
来获得标签),而不是考虑单词.
Instead of considering words, you can consider the couple (word, pos)
where pos
is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens)
)
您也可以尝试使用引理而不是单词.
You can also try to consider the lemmas instead of the words.
这里有一些有关N-gram建模的有趣演讲:
Here some interesting lectures about N-gram modelling:
- Introduction to N-grams
- Estimating N-gram Probabilities
这是一个未经优化的简单代码示例(2克):
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
这篇关于统计句子建议模型,例如拼写检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!