如何从R中的文档术语矩阵中删除空文档 [英] how to remove empty documents from document term matrix in R

查看:95
本文介绍了如何从R中的文档术语矩阵中删除空文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行针对Twitter数据的kmeans聚类,为此我正在清理tweet并创建一个语料库.后来我找到了dtm并使用了tf-idf理论.

I am performing kmeans clustering for twitter data, for which I am cleaning the tweets and creating a corpus. Later I find the dtm and use the tf-idf theory.

但是我的dtm很少要删除空文档,因为kmeans不能运行空文档.

But my dtm has few empty documents which I want to remove because kmeans can't run for empty docs.

这是我的代码:

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x <- gsub("[@]"," ",x)
  x <- gsub("[???]"," ",x)
  x <- gsub("[€]"," ",x)
  x

}

myStopwords <- c(stopwords('english'), "rt")


#preprocessing
tweet_corpus <- Corpus(VectorSource(tweet_raw$text))
tweet_corpus_clean <- tweet_corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeNumbers) %>%
  tm_map(removeWords,myStopwords) %>%
  tm_map(content_transformer(replacePunctuation)) %>%
  tm_map(stripWhitespace)%>%
  tm_map(content_transformer(removeURL))


dtm <- DocumentTermMatrix(tweet_corpus_clean ) 

#tf-idf

mat4 <- weightTfIdf(dtm) #when i run this, i get 2 docs that are empty
mat4 <- as.matrix(mat4)  

推荐答案

如果您的文档不包含任何条目/单词,则可以执行以下操作:

If your document does not contain any entry/word, then you could do this:

rowSumDoc <- apply(dtm, 1, sum) 
dtm2 <- dtm[rowSumDoc > 0, ] 

基本上,上面我们首先对每个文档中的单词求和.稍后,我们将根据每个文档中单词的早期总和为不为空的文档设置dtm.

Basically, above we are summing the words in each document first. Later, we are subsetting dtm for documents that are not empty based on earlier summation of words in each document.

这篇关于如何从R中的文档术语矩阵中删除空文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆