使用Google Ngrams训练语言模型 [英] train a language model using Google Ngrams

查看:417
本文介绍了使用Google Ngrams训练语言模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到一个给定单词的先前集合的条件概率.我打算同样使用Google N-grams.但是,由于拥有如此巨大的资源,我认为在计算机上不可能在计算上做到这一点. (处理所有N-gram,训练语言模型).

I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model).

那么我有什么方法可以使用Google Ngrams训练语言模型? (即使python NLTK库不再支持ngram语言模型) 注意-我知道可以使用ngram来训练语言模型,但是鉴于Google N克的大小很大,如何使用专门的Google ngrams来训练语言模型?

So is there any way I can train a language model using Google Ngrams ? (Even python NLTK library does not support ngram language model anymore) Note - I know that a language model can be trained using ngrams, but given the vast size of Google N grams, how can a language model be trained using specifically Google ngrams?

推荐答案

您应该从UC Berkley签出这个精巧的代码库:

You ought to check out this slick code base from UC Berkley: https://github.com/adampauls/berkeleylm

examples/文件夹中,您将找到一个bash脚本make-binary-from-google.sh,该脚本可从原始Google N-Grams创建紧凑的语言模型.最终的LM实现了愚蠢的退避,并利用了以下论文中描述的快速有效的数据结构:

In the examples/ folder, you will find a bash script make-binary-from-google.sh that creates a compact language model from the raw Google N-Grams. The resulting LM implements stupid backoff and utilizes a fast and efficient data structure described in the following paper: http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf

如果您只对最终训练有素的LM感兴趣,可以从伯克利托管的网站以多种语言下载它:

If you are just interested in the final trained LM, you can download it in a variety of languages from the Berkley-hosted website: http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/

这篇关于使用Google Ngrams训练语言模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆