R中的文本挖掘|内存管理 [英] Text Mining in R | memory management

查看：40 发布时间：2021/9/6 19:40:17 r text-mining

本文介绍了R中的文本挖掘|内存管理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 160 MB 的文本文件并进行数据挖掘，但似乎一旦我将其转换为矩阵以了解词频，那么它需要太多内存，有人可以帮助我吗

<代码>>dtm <- DocumentTermMatrix(clean)>时间<<DocumentTermMatrix(文档:472029，条款:171548)>>非/稀疏条目:3346670/80972284222稀疏度:100%最大学期长度:126权重:词频 (tf)>as.matrix(dtm)

<块引用>

错误:无法分配大小为 603.3 Gb 的向量

解决方案

@Vineet 这里的数学说明了为什么 R 试图分配 603Gb 以将文档术语矩阵转换为非稀疏矩阵.R 中矩阵中的每个数字单元格消耗 8 个字节.根据问题中文档术语矩阵的大小，数学如下所示:

<代码>>#># 计算矩阵消耗的内存>#>>行 <- 472029 #>列 <- 171548># 以千兆字节为单位的内存>行 * cols * 8/(1024 * 1024 * 1024)[1] 603.3155

如果要计算词频，最好生成 1-gram，然后将它们汇总为频率分布.

使用 quanteda 包，代码看起来像这样.

words <- tokenize(...)ngram1 <- unlist(tokens_ngrams(words,n=1))ngram1freq <- data.frame(table(ngram1))

问候，

连

2017-11-24 更新: 这是 quanteda 包中的一个完整示例，它使用 textstat_frequency() 函数从文档特征矩阵生成频率分布，以及前 20 个功能的 barplot().

这种方法不需要生成&将 n-gram 聚合为频率分布.

库(quanteda)myCorpus <- 语料库(data_char_ukimmig2010)system.time(theDFM <- dfm(myCorpus,tolower=TRUE,remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))system.time(textFreq <- textstat_frequency(theDFM))历史(文本频率$频率，main="词频分布:英国 2010 年选举宣言")top20 <- textFreq[1:20,]条形图(高度=top20$频率，names.arg=top20$feature,水平=假，las=2,main="前 20 个词:英国 2010 年选举宣言")

...以及由此产生的条形图:

I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one please help me in this

> dtm <- DocumentTermMatrix(clean)
> dtm
<<DocumentTermMatrix (documents: 472029, terms: 171548)>>
Non-/sparse entries: 3346670/80972284222
Sparsity           : 100%
Maximal term length: 126
Weighting          : term frequency (tf)
> as.matrix(dtm)

Error: cannot allocate vector of size 603.3 Gb

解决方案

@Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix. Each number cell in a matrix in R consumes 8 bytes. Based on the size of the document term matrix in the question, the math looks like:

> # 
> # calculate memory consumed by matrix
> #
> 
> rows <- 472029 # 
> cols <- 171548
> # memory in gigabytes
> rows * cols * 8 / (1024 * 1024 * 1024)
[1] 603.3155

If you want to calculate the word frequencies, you're better off generating 1-grams and then summarizing them into a frequency distribution.

With the quanteda package the code would look like this.

words <- tokenize(...) 
ngram1 <- unlist(tokens_ngrams(words,n=1))
ngram1freq <- data.frame(table(ngram1))

regards,

Len

2017-11-24 UPDATE: Here is a complete example from the quanteda package that generates the frequency distribution from a document feature matrix using the textstat_frequency() function, as well as a barplot() for the top 20 features.

This approach does not require the generation & aggregation of n-grams into a frequency distribution.

library(quanteda)
myCorpus <- corpus(data_char_ukimmig2010)
system.time(theDFM <- dfm(myCorpus,tolower=TRUE,
                      remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))
system.time(textFreq <- textstat_frequency(theDFM))

hist(textFreq$frequency,
     main="Frequency Distribution of Words: UK 2010 Election Manifestos")

top20 <- textFreq[1:20,]
barplot(height=top20$frequency,
        names.arg=top20$feature,
        horiz=FALSE,
        las=2,
        main="Top 20 Words: UK 2010 Election Manifestos")

...and the resulting barplot:

这篇关于R中的文本挖掘|内存管理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R中的文本挖掘|内存管理 [英] Text Mining in R | memory management

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R中的文本挖掘|内存管理 [英] Text Mining in R | memory management

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭