创建具有 4M 行的语料库和 DTM 的更有效方法 [英] More efficient means of creating a corpus and DTM with 4M rows
问题描述
我的文件有超过 400 万行,我需要一种更有效的方法将我的数据转换为语料库和文档术语矩阵,以便我可以将其传递给贝叶斯分类器.
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
考虑以下代码:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
输出:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
我的问题是,我可以使用什么来更快地创建语料库和 DTM?如果我使用超过 300k 行,它似乎非常慢.
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
我听说我可以使用 data.table
但我不确定如何使用.
I have heard that I could use data.table
but I am not sure how.
我也看过 qdap
包,但是在尝试加载包时它给了我一个错误,而且我什至不知道它是否可以工作.
I have also looked at the qdap
package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
参考.http://cran.r-project.org/web/packages/qdap/qdap.pdf
推荐答案
我认为您可能需要考虑一个更注重正则表达式的解决方案.这些是我作为开发人员正在努力解决的一些问题/想法.我目前正在大量研究 stringi
包以进行开发,因为它具有一些一致命名的函数,这些函数对于字符串操作非常快速.
I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi
package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.
在这个回复中,我尝试使用任何我知道的工具,它比 tm
可能给我们提供的更方便的方法更快(当然比 qdap
快得多).在这里,我什至没有探索并行处理或 data.table/dplyr,而是专注于使用 stringi
进行字符串操作,并将数据保存在矩阵中并使用旨在处理该格式的特定包进行操作.我以你的例子为例,将它乘以 100000 倍.即使使用词干,这在我的机器上也需要 17 秒.
In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm
may give us (and certainly much faster than qdap
). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi
and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
这篇关于创建具有 4M 行的语料库和 DTM 的更有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!