确定文本是否为英文? [英] Determine if text is in English?

查看:35
本文介绍了确定文本是否为英文?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我同时使用 NltkScikit Learn 进行一些文本处理.但是,在我的文件列表中,我有一些不是英文的文件.例如,以下情况可能为真:

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 

出于分析的目的,我希望将所有非英语句子作为预处理的一部分删除.但是,有没有好的方法可以做到这一点?我一直在谷歌搜索,但找不到任何能让我识别字符串是否为英文的具体内容.这是 NltkScikit learn 中没有提供的功能吗?编辑 我见过像 thisthis 但两者都是针对单个单词的......不是文档".我是否必须遍历句子中的每个单词来检查整个句子是否是英文?

For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

我使用的是 Python,因此 Python 中的库会更可取,但如果需要,我可以切换语言,只是认为 Python 最适合于此.

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

推荐答案

有一个名为 langdetect 的库.它是从 Google 的语言检测移植而来的,可在此处获得:

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

它支持 55 种开箱即用的语言.

It supports 55 languages out of the box.

这篇关于确定文本是否为英文?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆