如何检测语言 [英] How to detect language

查看:72
本文介绍了如何检测语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何好的开源引擎可以检测文本所用的语言,也许以概率度量?我可以在本地运行并且不查询Google或Bing的一个?我想在大约1500万页OCR文本中检测每一页的语言。

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

并非所有文档都包含使用拉丁字母的语言。

Not all documents will contain languages which use the Latin alphabet.

推荐答案

根据您的工作,您可能需要查看python Natural Language Processing Toolkit(NLTK),其中包含一些支持贝叶斯学习算法。

Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.

通常,字母和单词的频率可能是最快的评估,但是NLTK(或贝叶斯学习算法)可能会更快如果您需要做识别语言以外的任何事情,将非常有用。如果您发现前两种方法的错误率过高,贝叶斯方法也可能很有用。

In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.

这篇关于如何检测语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆