R tm removeWords 函数不删除单词 [英] R tm removeWords function not removing words

查看:20
本文介绍了R tm removeWords 函数不删除单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从我建立的语料库中删除一些词,但它似乎不起作用.我首先遍历所有内容并创建一个数据框,按出现的频率顺序列出我的单词.我使用这个列表来识别我不感兴趣的单词,然后尝试创建一个删除单词的新列表.但是,这些词保留在我的数据集中.我想知道我做错了什么,为什么这些词没有被删除?我在下面包含了完整的代码:

I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below:

install.packages("rvest")
install.packages("tm")
install.packages("SnowballC")
install.packages("stringr")
library(stringr) 
library(tm) 
library(SnowballC) 
library(rvest)

# Pull in the data I have been using. 
paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")
paperURLs <- paperList %>%
  html_nodes(xpath="//*[@class='search-results-title']/a") %>%
  html_attr("href")
paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")
paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x]))

paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
                      html_nodes(xpath="//*[@class='article-content']") %>%
                      html_text() %>%
                      str_trim(.))
# Create corpus
paperCorp <- Corpus(VectorSource(paperText))
for(j in seq(paperCorp))
{
  paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
  paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
  paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])
}

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)

paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))

paperCorp <- tm_map(paperCorp, stemDocument)

paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

dtm <- DocumentTermMatrix(paperCorpPTD)

termFreq <- colSums(as.matrix(dtm))
head(termFreq)

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)

# After having identified words I am not interested in
# create new corpus with these words removed.
paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                              "download", "google", "figure",
                                              "fig", "groups","Google", "however",
                                              "high", "human", "levels",
                                              "larger", "may", "number",
                                              "shown", "study", "studies", "this",
                                              "using", "two", "the", "Scholar",
                                              "pubmedncbi", "PubMedNCBI",
                                              "view", "View", "the", "biol",
                                              "via", "image", "doi", "one", 
                                              "analysis"))

paperCorp1 <- tm_map(paperCorp1, stripWhitespace)
paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument)
dtm1 <- DocumentTermMatrix(paperCorpPTD1)
termFreq1 <- colSums(as.matrix(dtm1))
tf1 <- data.frame(term = names(termFreq1), freq = termFreq1)
tf1 <- tf1[order(-tf1[,2]),]
head(tf1, 100)

如果您仔细查看 tf1,您会注意到许多指定要删除的单词实际上并没有被删除.

If you look through tf1 you will notice that plenty of the words that were specified to be removed have not actually been removed.

只是想知道我做错了什么,以及如何从我的数据中删除这些词?

Just wondering what I am doing wrong, and how I might remove these words from my data?

注意:removeWords 正在做一些事情,因为 head(tm, 100)head(tm1, 100) 的输出并不完全正确相同.所以 removeWords 似乎删除了我试图删除的单词的一些实例,但不是所有实例.

NOTE: removeWords is doing something because the output from head(tm, 100) and head(tm1, 100) are not exactly the same. So removeWords seems to removing some instances of the words I am trying to get rid of, but not all instances.

推荐答案

我改变了一些代码并添加到更低.停用词都是小写的,因此您需要先这样做,然后再删除停用词.

I switched some code around and added tolower. The stopwords are all in lowercase, so you need to do that first before you remove stopwords.

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
# added tolower
paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
# moved stripWhitespace
paperCorp <- tm_map(paperCorp, stripWhitespace)

paperCorp <- tm_map(paperCorp, stemDocument)

不再需要大写单词,因为我们将所有内容都设置为小写.您可以删除这些.

Upper case words no longer needed, since we set everything to lower case. You can remove these.

paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                               "download", "google", "figure",
                                               "fig", "groups","Google", "however",
                                               "high", "human", "levels",
                                               "larger", "may", "number",
                                               "shown", "study", "studies", "this",
                                               "using", "two", "the", "Scholar",
                                               "pubmedncbi", "PubMedNCBI",
                                               "view", "View", "the", "biol",
                                               "via", "image", "doi", "one", 
                                               "analysis"))

paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

dtm <- DocumentTermMatrix(paperCorpPTD)

termFreq <- colSums(as.matrix(dtm))
head(termFreq)

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)

           term  freq
fatty     fatty 29568
pparα     ppara 23232
acids     acids 22848
gene       gene 15360
dietary dietary 12864
scholar scholar 11904

tf[tf$term == "study"]


data frame with 0 columns and 1659 rows

正如您所看到的,结果是研究不再在语料库中.剩下的字也没有了

And as you can see, the outcome is that study is no longer in the corpus. The rest of the words are also gone

这篇关于R tm removeWords 函数不删除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆