在 R 中的语料库上删除停用词和降低功能 [英] Remove stopwords and tolower function slow on a Corpus in R

查看：64 发布时间：2021/6/15 19:36:41 r performance text-mining tm

本文介绍了在 R 中的语料库上删除停用词和降低功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大约 75 MB 数据的语料库.我正在尝试使用以下命令

I have corpus roughly with 75 MB data. I am trying to use the following command

tm_map(doc.corpus, removeWords, stopwords("english"))
tm_map(doc.corpus, tolower)

这两个单独的函数至少需要 40 分钟才能运行.我正在寻找加速过程，因为我正在为我的模型使用 tdm 矩阵.

This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model.

我经常尝试像 gc() 和 memory.limit(10000000) 这样的命令，但我无法加快处理速度.

I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed.

我有一个具有 4GB RAM 并运行本地数据库来读取输入数据的系统.

I have a system with 4GB RAM and running a local database to read the input data.

希望提速建议！

推荐答案

也许你可以试试 quanteda

Maybe you can give quanteda a try

library(stringi)
library(tm)
library(quanteda)

txt <- stri_rand_lipsum(100000L)
print(object.size(txt), units = "Mb")
# 63.4 Mb

system.time(
  dfm <- dfm(txt, toLower = TRUE, ignoredFeatures = stopwords("en")) 
)
# Elapsed time: 12.3 seconds.
#        User      System verstrichen 
#       11.61        0.36       12.30 

system.time(
  dtm <- DocumentTermMatrix(
    Corpus(VectorSource(txt)), 
    control = list(tolower = TRUE, stopwords = stopwords("en"))
  )
)
#  User      System verstrichen 
# 157.16        0.38      158.69

这篇关于在 R 中的语料库上删除停用词和降低功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 R 中的语料库上删除停用词和降低功能 [英] Remove stopwords and tolower function slow on a Corpus in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 R 中的语料库上删除停用词和降低功能 [英] Remove stopwords and tolower function slow on a Corpus in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭