在 R 中的语料库上删除停用词和降低功能 [英] Remove stopwords and tolower function slow on a Corpus in R
本文介绍了在 R 中的语料库上删除停用词和降低功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有大约 75 MB 数据的语料库.我正在尝试使用以下命令
I have corpus roughly with 75 MB data. I am trying to use the following command
tm_map(doc.corpus, removeWords, stopwords("english"))
tm_map(doc.corpus, tolower)
这两个单独的函数至少需要 40 分钟才能运行.我正在寻找加速过程,因为我正在为我的模型使用 tdm
矩阵.
This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm
matrix for my model.
我经常尝试像 gc()
和 memory.limit(10000000)
这样的命令,但我无法加快处理速度.
I have tried commands like gc()
and memory.limit(10000000)
very frequently but I am not able to speed up my process speed.
我有一个具有 4GB
RAM 并运行本地数据库来读取输入数据的系统.
I have a system with 4GB
RAM and running a local database to read the input data.
希望提速建议!
推荐答案
也许你可以试试 quanteda
Maybe you can give quanteda a try
library(stringi)
library(tm)
library(quanteda)
txt <- stri_rand_lipsum(100000L)
print(object.size(txt), units = "Mb")
# 63.4 Mb
system.time(
dfm <- dfm(txt, toLower = TRUE, ignoredFeatures = stopwords("en"))
)
# Elapsed time: 12.3 seconds.
# User System verstrichen
# 11.61 0.36 12.30
system.time(
dtm <- DocumentTermMatrix(
Corpus(VectorSource(txt)),
control = list(tolower = TRUE, stopwords = stopwords("en"))
)
)
# User System verstrichen
# 157.16 0.38 158.69
这篇关于在 R 中的语料库上删除停用词和降低功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文