在R中对大数据使用tm的语料库功能 [英] Use tm's Corpus function with big data in R

查看：166 发布时间：2020/9/20 19:37:23 r bigdata text-mining tm

本文介绍了在R中对大数据使用tm的语料库功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用tm在R中的大数据上进行文本挖掘.

I'm trying to do text mining on big data in R with tm.

我经常遇到内存问题(例如can not allocation vector of size....)，并使用解决这些问题的既定方法，例如

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

使用64位R
尝试不同的操作系统(Windows，Linux，Solaris等)
将memory.limit()设置为最大
确保服务器(有)上有足够的RAM和计算可用
自由使用gc()
分析瓶颈代码
将大型业务分解为多个较小的业务

using 64-bit R
trying different OS's (Windows, Linux, Solaris, etc)
setting memory.limit() to its maximum
making sure that sufficient RAM and compute is available on the server (which there is)
making liberal use of gc()
profiling the code for bottlenecks
breaking up big operations into multiple smaller operations

但是，当尝试在大约一百万个文本字段的向量上运行Corpus时，遇到的内存错误与平时略有不同，我不确定如何解决该问题.错误是:

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

我可以(并且应该)在该源数据帧的行块上增量运行Corpus，然后合并结果吗?有没有更有效的方法来执行此操作?

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

将产生此错误的数据大小取决于运行它的计算机，但是如果您使用内置的crude数据集并复制文档直到其足够大，则可以复制该错误.

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

更新

我一直在尝试合并较小的corpa，即

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

，虽然我没有成功，但我确实发现了tm_combine，它是应该可以解决这个确切的问题.唯一的问题是由于某种原因，我的64位R 3.1.1版本和最新版本的tm无法找到函数tm_combine.也许是由于某种原因将其从包装中删除了?我正在调查...

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"

推荐答案

我不知道tm_combine是否已被弃用，或者为什么在tm名称空间中找不到它，但是我确实通过使用在数据框的较小块上合并它们.

I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

此 StackOverflow帖子具有一种简单的方法，无需tm_combine:

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

为您提供:

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

对不起，在问之前不要自己解决这个问题.我尝试了其他组合对象的方式，但失败了.

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

这篇关于在R中对大数据使用tm的语料库功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在R中对大数据使用tm的语料库功能 [英] Use tm's Corpus function with big data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中对大数据使用tm的语料库功能 [英] Use tm&#39;s Corpus function with big data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

在R中对大数据使用tm的语料库功能 [英] Use tm's Corpus function with big data in R

登录关闭