创建具有 4M 行的语料库和 DTM 的更有效方法 [英] More efficient means of creating a corpus and DTM with 4M rows

查看:14
本文介绍了创建具有 4M 行的语料库和 DTM 的更有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文件有超过 400 万行,我需要一种更有效的方法将我的数据转换为语料库和文档术语矩阵,以便我可以将其传递给贝叶斯分类器.

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

考虑以下代码:

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

输出:

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1

我的问题是,我可以使用什么来更快地创建语料库和 DTM?如果我使用超过 300k 行,它似乎非常慢.

My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.

我听说我可以使用 data.table 但我不确定如何使用.

I have heard that I could use data.table but I am not sure how.

我也看过 qdap 包,但是在尝试加载包时它给了我一个错误,而且我什至不知道它是否可以工作.

I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.

参考.http://cran.r-project.org/web/packages/qdap/qdap.pdf

推荐答案

我认为您可能需要考虑一个更注重正则表达式的解决方案.这些是我作为开发人员正在努力解决的一些问题/想法.我目前正在大量研究 stringi 包以进行开发,因为它具有一些一致命名的函数,这些函数对于字符串操作非常快速.

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

在这个回复中,我尝试使用任何我知道的工具,它比 tm 可能给我们提供的更方便的方法更快(当然比 qdap 快得多).在这里,我什至没有探索并行处理或 data.table/dplyr,而是专注于使用 stringi 进行字符串操作,并将数据保存在矩阵中并使用旨在处理该格式的特定包进行操作.我以你的例子为例,将它乘以 100000 倍.即使使用词干,这在我的机器上也需要 17 秒.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm

这篇关于创建具有 4M 行的语料库和 DTM 的更有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆