从 R 中的文本中提取英语单词 [英] Extract English words from a text in R

查看:37
本文介绍了从 R 中的文本中提取英语单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本,我需要从中提取所有英文单词.例如我想要一个函数来分析向量

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector

vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")

并且只从这个向量中返回英文单词,即图片"、地毯"、灯"

And return only English words from this vector i.e. "picture", "carpet", "lamp"

我知道英语单词"的定义取决于词典,但即使有基本词典我也会满意.

I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.

推荐答案

你可以使用我维护的包qdapDictionaries(不需要安装父包qdap).如果您的数据更复杂,您可能需要使用诸如 tolower 之类的工具来使其工作.这里的想法基本上是查看已知单词列表 ?GradyAugmented 与您的单词相交的位置.以下是两种非常相似的方法,第一种方法可能会稍微快一些,具体取决于数据:

You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:

vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")

library(qdapDictionaries)
vector[vector %in% GradyAugmented]

## [1] "picture" "carpet"  "lamp"

intersect(vector, GradyAugmented)

## [1] "picture" "carpet"  "lamp"   

您在安装 qdap 时收到的错误听起来像是 @Ben Bolker 是正确的.您将需要 data.table 已安装(使用 packageVersion("data.table") 来检查).这是我的疏忽,不需要最小版本的 data.table,我认为 setDT(data.table 中的一个函数)包)一直存在,但它似乎不在您的版本中.但是要解决这个特殊问题,您不需要安装父 qdap 包,只需安装 qdapDictionaries.

The error you are receiving with installing qdap sounds like @Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

这篇关于从 R 中的文本中提取英语单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆