联接结果超过2 ^ 31行(内部vecseq达到物理限制) [英] Join results in more than 2^31 rows (internal vecseq reached physical limit)

查看:169
本文介绍了联接结果超过2 ^ 31行(内部vecseq达到物理限制)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚尝试在R 3.0.1中将两个表合并在具有64G内存的计算机上,并收到以下错误.帮助将不胜感激. (data.table版本为1.8.8)

I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8)

这是我的代码的样子:

library(parallel)
library(data.table)

data1:几百万行和3列.列是tagprodv. tag有750K个唯一值,每个tag在1到1000个prod s之间,prod的5000个可能值. v取正值.

data1: several million rows and 3 columns. The columns are tag, prod and v. There are 750K unique values of tag, anywhere from 1 to 1000 prods per tag, 5000 possible values for prod. v takes any positive real value.

setkey(data1, tag)
merge (data1, data1, allow.cartesian=TRUE)

我收到以下错误:

vecseq(f_ ,len _,如果(allow.cartesian)NULL为null,否则为as.integer(max(nrow(x),: 连接的结果超过2 ^ 31行(内部vecseq达到了物理限制).很可能是错误指定的加入.检查i中是否有重复的键值,每个键值都一遍又一遍地连接到x中的同一组.如果可以,请尝试包括j并删除by(逐个而不逐个),以便j为每个组运行以避免大的分配.否则,请在FAQ,Wiki,堆栈溢出和数据表帮助中搜索此错误消息,以获取建议. 调用:merge-> merge.data.table-> [-> [.data.table-> vecseq

Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq

一个新的例子,显示了逐个显示

country = fread("
country product share
1 5 .2
1 6 .2
1 7 .6
2 6 .3
2 7 .1
2 8 .4
2 9 .2
")
prod = fread("
prod period value
5 1990 2
5 1991 3
5 1992 2
5 1993 4
5 1994 3
5 1995 5
6 1990 1
6 1991 1
6 1992 0
6 1993 4
6 1994 8
6 1995 2
7 1990 3
7 1991 3
7 1992 3
7 1993 4
7 1994 7
7 1995 1
8 1990 2
8 1991 4
8 1992 2
8 1993 4
8 1994 2
8 1995 6
9 1990 1
9 1991 2
9 1992 4
9 1993 4
9 1994 5
9 1995 6
")

似乎完全不可能选择共享国家标签的市场子集,在这些对中找到协方差,并按国家/地区对它们进行整理,而不会超出规模限制. 这是迄今为止我最好的照片:

It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far:

setkey(country,country)
setkey(prod, prod, period)
covars <- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table.
clevel <-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"]
> clevel
   country countryvar
1:       1   2.858667
2:       2   1.869667

当我尝试此方法处理任何合理大小的数据时,我遇到了vecseq错误.如果data.table在2 ^ 31的限制下没有那么大,那就太好了.我是这个包裹的粉丝.关于如何使用更多j规范的建议也将不胜感激. (鉴于我必须根据两个数据表的交集计算方差,因此我不确定是否还要尝试J规范).

When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).

推荐答案

R 3.0.1 支持对象,这些对象的长度大于2 ^ 31-1.创建此类对象时,所贡献的程序包是否可以执行相同操作取决于程序包.基本上,任何使用已编译代码的软件包都必须重新编译,并且可能需要利用此功能进行修改.

R 3.0.1 supports objects having lengths greater than 2^31 - 1. While the packages that come with base R can already create such objects, whether contributed packages can do the same depends on the package. Basically, any package using compiled code would have to be recompiled and possibly modified to take advantage of this feature.

此外,假设64GB RAM足以处理60GB对象也很乐观.

Also, assuming that 64GB RAM is enough to work with 60GB objects is kind of optimistic.

这篇关于联接结果超过2 ^ 31行(内部vecseq达到物理限制)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆