使用Google Ngrams训练语言模型 [英] train a language model using Google Ngrams
问题描述
我想找到一个给定单词的先前集合的条件概率.我打算同样使用Google N-grams
.但是,由于拥有如此巨大的资源,我认为在计算机上不可能在计算上做到这一点. (处理所有N-gram,训练语言模型).
I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams
for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model).
那么我有什么方法可以使用Google Ngrams训练语言模型? (即使python NLTK
库不再支持ngram
语言模型)
注意-我知道可以使用ngram来训练语言模型,但是鉴于Google N克的大小很大,如何使用专门的Google ngrams来训练语言模型?
So is there any way I can train a language model using Google Ngrams ? (Even python NLTK
library does not support ngram
language model anymore)
Note - I know that a language model can be trained using ngrams, but given the vast size of Google N grams, how can a language model be trained using specifically Google ngrams?
推荐答案
You ought to check out this slick code base from UC Berkley: https://github.com/adampauls/berkeleylm
在examples/
文件夹中,您将找到一个bash脚本make-binary-from-google.sh
,该脚本可从原始Google N-Grams创建紧凑的语言模型.最终的LM实现了愚蠢的退避,并利用了以下论文中描述的快速有效的数据结构:
In the examples/
folder, you will find a bash script make-binary-from-google.sh
that creates a compact language model from the raw Google N-Grams. The resulting LM implements stupid backoff and utilizes a fast and efficient data structure described in the following paper: http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf
如果您只对最终训练有素的LM感兴趣,可以从伯克利托管的网站以多种语言下载它:
If you are just interested in the final trained LM, you can download it in a variety of languages from the Berkley-hosted website: http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/
这篇关于使用Google Ngrams训练语言模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!