合并 data.tables 使用超过 10 GB RAM [英] Merging data.tables uses more than 10 GB RAM
问题描述
我有两个 data.tables:DT
和 meta
.当我使用 DT[meta]
合并它们时,内存使用量增加了 10 GB 以上(并且合并非常慢).怎么了?好像合并成功了,但是只能看单行,不然内存不够.DT
本身是通过合并两个data.tables创建的,没有任何问题.
I have two data.tables: DT
and meta
. When I merge them using DT[meta]
, memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT
itself was created by merging two data.tables without any problems.
钥匙好像有问题.我可以毫无问题地执行以下操作:
It seems to be a problem with the key. I can do the following without a problem:
DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)
DT2<-DT[meta] # Comment from Matthew Dowle:
# X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it?
unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
Moor_ID
[1,] TRUE
第一个数据表:
str(DT)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 55 variables:
$ DWD_ID : chr "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
$ numdays : int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
$ TBoden_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM : num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin : num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax : num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR : num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw : num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian : num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM : num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin : num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax : num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR : num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw : num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian : num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM : num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin : num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax : num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean : num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw : num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian : num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean : num 100.8 101 99.7 99.9 101.1 ...
$ es_day : num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day : num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day : num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil : num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl: int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum : num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day : int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day : num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind : num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day : num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean: num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day : num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day : num 2.45 2.46 2.46 2.46 2.47 ...
$ sAnstieg_day : num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day : num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day : num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Distanz_in_km : num 24 24 24 24 24 ...
$ North : num 53.5 53.5 53.5 53.5 53.5 ...
$ East : num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m : num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km : num 20 20 20 20 20 ...
$ peatland : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m : num -1.6 0 -0.2 0.1 -0.4 ...
- attr(*, "sorted")= chr "Moor_ID"
- attr(*, ".internal.selfref")=<externalptr>
第二个data.table:
Second data.table:
str(meta)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 6 variables:
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Hoehe_Moor : num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km : num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK: num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK: num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2: num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Moor_ID"
会话信息:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] grDevices datasets splines graphics stats tcltk utils methods base
other attached packages:
[1] reshape_0.8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14
loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1
推荐答案
我的错.问题是键不是唯一的:
My bad. The problem was that keys were not unique:
a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
x y y.1
[1,] 1 1 3
[2,] 1 2 3
[3,] 1 1 4
[4,] 1 2 4
如果 data.table 能给出警告就好了.
It would be nice if data.table could give a warning for that.
来自马修的更新
此警告现已在 v1.8.7 中实施:
This warning has now been implemented in v1.8.7 :
新参数 allow.cartesian
(默认 FALSE
)添加到 X[Y]
和 merge(X,Y)代码>,#2464.防止由于指定错误的连接而导致大量分配;例如,将
Y
中的重复键值一遍又一遍地加入 X
中的同一组.当返回多于 max(nrow(X),nrow(Y))
行时,单词 cartesian 被松散地使用.错误消息很冗长,包括建议.
New argument
allow.cartesian
( defaultFALSE
) added toX[Y]
andmerge(X,Y)
, #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values inY
joining to the same group inX
over and over again. The word cartesian is used loosely for when more thanmax(nrow(X),nrow(Y))
rows would be returned. The error message is verbose and includes advice.
这篇关于合并 data.tables 使用超过 10 GB RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!