R:在矢量模式下使用mgsub时,如何防止内存溢出? [英] R: How to prevent memory overflow when using mgsub in vector mode?

查看:97
本文介绍了R:在矢量模式下使用mgsub时,如何防止内存溢出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很长的字符向量(例如"Hello World"等),有1.7M行,我需要使用两个向量之间的映射来替换其中的单词,并将结果保存在相同的向量中.这是一个简单的示例:

I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:

library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)

结果:

[1] "ONE"  "TWO ONE" "four phONEs"

如您所见,行中的每个e[j]实例都将替换为r[j]且仅替换为r[j]. 它在相对较小的行"和e->r词汇长度上工作正常,但是当我在length(line) = 1700000length(e) = 750上运行时,我达到了分配的总内存:

As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j]. It works fine on a relatively small "line" and e->r vocabulary length, but when I run on length(line) = 1700000 and length(e) = 750, I reach the total allocated memory:

Reached total allocation of 7851Mb: see help(memory.size)

任何想法如何避免呢?

Any ideas how to avoid it?

推荐答案

我相信您可以使用fixed = TRUE.

您似乎担心听起来像是空格的问题,因此只需在要使用的所有3个向量的末尾添加空格即可.要运行从## Start## Finish的整个序列(大约为数据大小),需要对170万个字符串使用Time difference of 2.906395 secs.大多数时间是在结束多余空间后结束的.

You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from ## Start to ## Finish (roughly the size of your data) takes Time difference of 2.906395 secs on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.

## Recreate data
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key")
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "CAPS")

line <- rep(line, 1700000/length(line))

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\\s|\\s$", "", line2, perl=TRUE)
## Finish

此处 qdap mgsub无效.该软件包专为较小的数据而设计.另外,fixed = TRUE是明智的默认设置,因为它快得多.附加软件包的目的是通过重新配置可用工具来改进工作流程(有时是针对特定领域/任务的). mgsub函数也有一些错误处理,还有一些其他细节在分析使函数成为hog内存的脚本时很有用.在安全+语法糖 速度之间通常需要权衡取舍.

Here qdap's mgsub is not useful. The package was designed for much smaller data. Additionally, the fixed = TRUE is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. The mgsub function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.

请注意,仅因为2个函数以类似的方式命名并不意味着任何暗示,特别是如果在add on软件包中找到它们.甚至基数R中的函数也具有不同的名称和默认值(请参见apply函数家族;此问题虽然不理想,但却是R历史沿革的一部分).作为用户,您有责任阅读文档而不作任何假设.

Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the apply family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.

这篇关于R:在矢量模式下使用mgsub时,如何防止内存溢出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆