使用斯坦福 NLP 检测语言 [英] Detecting language using Stanford NLP

查看:28
本文介绍了使用斯坦福 NLP 检测语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以使用 Stanford CoreNLP 来检测句子是用哪种语言编写的?如果是这样,这些算法的精确度如何?

I'm wondering if it is possible to use Stanford CoreNLP to detect which language a sentence is written in? If so, how precise can those algorithms be?

推荐答案

几乎可以肯定,目前斯坦福 COreNLP 中没有语言识别.几乎" - 因为不存在更难证明.

Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.

不过,以下是间接证据:

Nevertheless, below are circumstantial evidences:

  1. main 上也没有提到语言识别页面,也不是CoreNLP 页面,也不是FAQ(虽然有一个问题我如何在其他语言上运行 CoreNLP?"),也不在 2014CoreNLP 作者的论文
  2. 结合多个 NLP 库的工具包括斯坦福 CoreNLP 使用另一个库作为语言标识,例如 DKPro Core ASL;还有 其他用户谈论语言识别和CoreNLP没有提到这个能力
  3. CoreNLP 源文件包含Language类,但与语言识别无关 - 你可以手动检查所有 84 个语言"字词 这里
  1. there is no mention of language identification neither on main page, nor CoreNLP page, nor in FAQ (although there is a question 'How do I run CoreNLP on other languages?'), nor in 2014 paper of CoreNLP's authors;
  2. tools that combine several NLP libs including Stanford CoreNLP use another lib for language identification, for example DKPro Core ASL; also other users talking about language identification and CoreNLP don't mention this capability
  3. source file of CoreNLP contains Language classes, but nothing related to language identification - you can check manually for all 84 occurrence of 'language' word here

尝试 TIKATextCatJava 语言检测库(他们报告说53 种语言的准确率超过 99%").

Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").

一般来说,质量取决于输入文本的大小:如果它足够长(比如至少几个单词并且没有特别选择),那么精度可以相当不错 - 大约 95%.

In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.

这篇关于使用斯坦福 NLP 检测语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆