统计句子建议模型,如拼写检查 [英] Statistical sentence suggestion model like spell checking

查看:21
本文介绍了统计句子建议模型,如拼写检查的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经有可用的拼写检查模型,可帮助我们根据经过训练的正确拼写语料库找到建议的正确拼写.是否可以将粒度从字母表中增加到单词",以便我们甚至可以有短语建议,这样如果输入了不正确的短语,那么它应该从正确短语的语料库中建议最接近的正确短语,当然它是从一个有效短语列表.

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.

是否有任何 Python 库已经实现了此功能,或者如何针对现有的大型黄金标准短语语料库进行此操作以获得统计相关的建议?

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?

注意:这与拼写检查器不同,因为拼写检查器中的字母表是有限的,而在短语更正器中,字母表本身就是一个单词,因此理论上是无限的,但我们可以限制短语库中的单词数.

Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.

推荐答案

您要构建的是一个 N-gram 模型,该模型包括计算每个单词跟随 n 个单词的序列的概率.

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.

您可以使用 NLTK 文本语料库 来训练您的模型,或者您可以标记您的模型自己的语料库,带有 nltk.sent_tokenize(text)nltk.word_tokenize(sentence).

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).

你可以考虑 2-gram(马尔可夫模型):

You can consider 2-gram (Markov model):

小猫"跟随可爱"的概率是多少?

What is the probability for "kitten" to follow "cute"?

...或 3 克:

小猫"跟随可爱"的概率是多少?

What is the probability for "kitten" to follow "the cute"?

显然用 n+1-gram 训练模型比 n-gram 成本更高.

Obviously training the model with n+1-gram is costlier than n-gram.

您可以考虑一对 (word, pos) 而不是考虑单词,其中 pos 是词性标签(您可以使用 (word, pos) 获取标签代码>nltk.pos_tag(tokens))

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))

您也可以尝试考虑引理而不是单词.

You can also try to consider the lemmas instead of the words.

这里有一些关于 N-gram 建模的有趣讲座:

Here some interesting lectures about N-gram modelling:

  1. N-gram 简介
  2. 估计 N-gram 概率

这是一个简单而简短的代码示例(2-gram)未优化:

This is a simple and short example of code (2-gram) not optimized:

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

这篇关于统计句子建议模型,如拼写检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆