短文本的语言检测 [英] Language detection for very short text

查看：62 发布时间：2020/5/18 0:45:20 nlp nltk language-detection

本文介绍了短文本的语言检测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在创建一个用于检测短文本语言的应用程序，其平均值为< 100个字符并包含语(例如推文，用户查询，短信).

I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).

我测试的所有库都适用于普通网页，但不适用于短文本.到目前为止，效果最好的库是Chrome的语言检测(CLD)库，我必须将其构建为共享库.

All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.

CLD失败.查看CLD的源代码后，我发现它使用4克代码，这可能就是原因.

CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.

我现在想提高准确性的方法是:

The approach I'm thinking of right now to improve the accuracy is:

删除商标名称，数字，网址和诸如软件"，下载"，互联网"之类的字词
使用词典当文本中包含多个超出阈值的短单词或包含的单词太少时.
该词典是根据维基百科新闻文章和hunspell词典创建的.

哪个数据集最适合此任务?我该如何改善这种方法?

What dataset is most suitable for this task? And how can I improve this approach?

到目前为止，我正在使用EUROPARL和Wikipedia文章.我正在使用NLTK进行大部分工作.

So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.

短文本的语言检测 [英] Language detection for very short text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

短文本的语言检测 [英] Language detection for very short text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭