检测 R 中的文本语言 [英] Detect text language in R

查看:28
本文介绍了检测 R 中的文本语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个推文列表,我只想保留那些是英文的.

I have a list of tweets and I would like to keep only those that are in English.

我该怎么做?

推荐答案

textcat 包就是这样做的.它可以检测 74 种语言"(更准确地说,是语言/编码组合),还有更多其他扩展.详细信息和示例在这篇免费提供的文章中:

The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., &Feinerer, I. 用于基于 n-Gram 的文本分类的 textcat 包 in R. Journal of Statistical Software, 52, 1-17.

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

这是摘要:

确定所使用的语言通常是大多数情况下的第一步自然语言处理任务.在种类繁多的语言中文献中讨论的识别方法,采用的是Cavnar 和 Trenkle (1994) 的文本分类方法基于字符 n-gram 频率特别成功.这个论文介绍了用于基于 n-gram 的文本的 R 扩展包 textcat实现 Cavnar 和 Trenkle 方法的分类以及旨在消除冗余的简化 n-gram 方法的原始方法.从多语种语料库中获得用于选择主题的维基百科页面用于说明包的功能和性能提供语言识别方法.

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

这是他们的例子之一:

library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 

这篇关于检测 R 中的文本语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆