从 R 语料库中删除无意义的单词 [英] Remove meaningless words from corpus in R

查看:37
本文介绍了从 R 语料库中删除无意义的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tmwordcloud 在 R 中执行一些基本的文本挖掘.正在处理的文本包含许多像 asfdg、aawptkr 这样没有意义的词,我需要过滤这样的词.我找到的最接近的解决方案是使用 library(qdapDictionaries) 并构建一个自定义函数来检查单词的有效性.

I am using tm and wordcloud for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words. The closest solution i have found is using library(qdapDictionaries) and building a custom function to check validity of words.

library(qdapDictionaries)
is.word  <- function(x) x %in% GradyAugmented

# example
> is.word("aapg")
[1] FALSE

使用的其余文本挖掘是:

The rest of text mining used is :

curDir <- "E:/folder1/"  # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus

问题是 is.word() 可以很好地处理数据帧,但如何将其用于 语料库 处理?

The issue is is.word() works fine for handling dataframes but how to use it for corpus handling ?

谢谢

推荐答案

不确定它是否是最节省资源的方法(我不太了解包)但它应该有效:

Not sure if it will be the most resource efficient method (I don't know the package very well) but it should work:

tdm <- TermDocumentMatrix(myCorpus )
all_tokens       <- findFreqTerms(tdm, 1)
tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), 
                 tokens_to_remove)

这篇关于从 R 语料库中删除无意义的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆