在R中对大数据使用tm的语料库功能 [英] Use tm's Corpus function with big data in R

查看:166
本文介绍了在R中对大数据使用tm的语料库功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用tm在R中的大数据上进行文本挖掘.

I'm trying to do text mining on big data in R with tm.

我经常遇到内存问题(例如can not allocation vector of size....),并使用解决这些问题的既定方法,例如

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

  • 使用64位R
  • 尝试不同的操作系统(Windows,Linux,Solaris等)
  • memory.limit()设置为最大
  • 确保服务器(有)上有足够的RAM和计算可用
  • 自由使用gc()
  • 分析瓶颈代码
  • 将大型业务分解为多个较小的业务
  • using 64-bit R
  • trying different OS's (Windows, Linux, Solaris, etc)
  • setting memory.limit() to its maximum
  • making sure that sufficient RAM and compute is available on the server (which there is)
  • making liberal use of gc()
  • profiling the code for bottlenecks
  • breaking up big operations into multiple smaller operations

但是,当尝试在大约一百万个文本字段的向量上运行Corpus时,遇到的内存错误与平时略有不同,我不确定如何解决该问题.错误是:

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

我可以(并且应该)在该源数据帧的行块上增量运行Corpus,然后合并结果吗?有没有更有效的方法来执行此操作?

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

将产生此错误的数据大小取决于运行它的计算机,但是如果您使用内置的crude数据集并复制文档直到其足够大,则可以复制该错误.

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

更新

我一直在尝试合并较小的corpa,即

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

,虽然我没有成功,但我确实发现了tm_combine,它是应该可以解决这个确切的问题.唯一的问题是由于某种原因,我的64位R 3.1.1版本和最新版本的tm无法找到函数tm_combine.也许是由于某种原因将其从包装中删除了?我正在调查...

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"

推荐答案

我不知道tm_combine是否已被弃用,或者为什么在tm名称空间中找不到它,但是我确实通过使用在数据框的较小块上合并它们.

I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

StackOverflow帖子具有一种简单的方法,无需tm_combine:

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

为您提供:

ds.12

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

对不起,在问之前不要自己解决这个问题.我尝试了其他组合对象的方式,但失败了.

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

这篇关于在R中对大数据使用tm的语料库功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆