如何检测语言 [英] How to detect language
问题描述
是否有任何好的开源引擎可以检测文本所用的语言,也许以概率度量?我可以在本地运行并且不查询Google或Bing的一个?我想在大约1500万页OCR文本中检测每一页的语言。
Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.
并非所有文档都包含使用拉丁字母的语言。
Not all documents will contain languages which use the Latin alphabet.
推荐答案
根据您的工作,您可能需要查看python Natural Language Processing Toolkit(NLTK),其中包含一些支持贝叶斯学习算法。
Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.
通常,字母和单词的频率可能是最快的评估,但是NLTK(或贝叶斯学习算法)可能会更快如果您需要做识别语言以外的任何事情,将非常有用。如果您发现前两种方法的错误率过高,贝叶斯方法也可能很有用。
In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.
这篇关于如何检测语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!