合并data.tables使用超过10 GB的RAM [英] Merging data.tables uses more than 10 GB RAM
问题描述
我有两个data.tables: DT
和 meta
。当我使用 DT [meta]
合并它们时,内存使用增加了超过10 GB(并且合并非常慢)。出了什么问题?看起来像合并是成功的,但我只能看单行,否则我耗尽内存。 DT
本身是通过合并两个data.tables而没有任何问题创建的。
编辑:
这似乎是钥匙的问题。我可以做到以下没有问题:
DT [,id:= 1:nrow meta [,id:= 1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)
DT2 <-DT来自Matthew Dowle:
#X [Y](或合并)在1:nrow(DT)的键只是一个cbind,不是吗?
unique(DT2 [,Moor_ID,with = F] == DT2 [,Moor_ID.1,with = F])
Moor_ID
[1, TRUE
第一个数据表:
str(DT)
Classes'data.table'和'data.frame':10212 obs。的55个变量:
$ DWD_ID:chrBremerhavBremerhavBremerhavBremerhav...
$ numdays:int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days:日期,格式:2009-09-012009-09-022009-09-032009-09-04...
$ TBoden_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian:num NA NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM:num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin:num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax:num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR:num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw:num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian:num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM:num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin:num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax:num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR:num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw: num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian:num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM:num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin:num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax:num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean:num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw:num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian:num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean:num 100.8 101 99.7 99.9 101.1 ...
$ es_day:num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day:num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day:num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil:num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl:int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum:num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day:int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day:num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind:num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day:num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean:num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day:num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day:num 2.45 2.46 2.46 2.46 2.47 .. 。
$ sAnstieg_day:num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day:num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day:num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID:chrAhlenmoorAhlenmoorAhlenmoorAhlenmoor...
$ Distanz_in_km:num 24 24 24 24 24 ...
$ North:num 53.5 53.5 53.5 53.5 53.5 ...
$ East:num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m:num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km:num 20 20 20 20 20 ...
$ peatland:logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m:num -1.6 0 -0.2 0.1 -0.4 ...
- attr *,sorted)= chrMoor_ID
-attr(*,.internal.selfref)=< externalptr>
第二个数据表:
str(meta)
Classes'data.table'和'data.frame':10212 obs。的6个变量:
$ Moor_ID:chrAhlenmoorAhlenmoorAhlenmoorAhlenmoor...
$ Hoehe_Moor:num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km:num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK:num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK:num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2:num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*,.internal.selfref)=< externalptr>
- attr(*,sorted)= chrMoor_ID
会话信息:
R版本2.15.1(2012-06-22)
平台:x86_64-pc-mingw32 / x64 64位)
locale:
[1] LC_COLLATE = German_Germany.1252 LC_CTYPE = German_Germany.1252 LC_MONETARY = German_Germany.1252 LC_NUMERIC = C
[5] LC_TIME = German_Germany。 1252
附加的基本包:
[1] grDevices datasets splines graphics stats tcltk utils方法base
其他附加包:
[1] reshape_0。 8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14
通过命名空间(并未附加):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1
我的坏。问题是密钥不是唯一的:
a <-data.table(x = c(1,1),y = c(1,2))
b <-data.table(x = c(1,1),y = c(3,4))
setkey(a,x)
setkey(b,x)
a [b]
xy y.1
[1,] 1 1 3
[2,] 1 2 3
[ ] 1 1 4
[4,] 1 2 4
.table可以给出警告。
从Matthew更新 b
$ b
此警告现已在v1.8.7中实施:
新参数
X [Y]
和<$中添加
,#2464。防止由于错误的连接而导致的大量分配;例如,(默认 FALSE
c $ c> merge(X,Y)Y
中的重复键值重复加入X
中的同一组。当大于max(nrow(X),nrow(Y))
的行将被返回时,笛卡儿字被宽松地使用。错误消息是详细的,并包含建议。
I have two data.tables: DT
and meta
. When I merge them using DT[meta]
, memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT
itself was created by merging two data.tables without any problems.
Edit:
It seems to be a problem with the key. I can do the following without a problem:
DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)
DT2<-DT[meta] # Comment from Matthew Dowle:
# X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it?
unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
Moor_ID
[1,] TRUE
First data.table:
str(DT)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 55 variables:
$ DWD_ID : chr "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
$ numdays : int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
$ TBoden_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM : num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin : num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax : num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR : num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw : num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian : num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM : num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin : num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax : num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR : num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw : num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian : num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM : num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin : num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax : num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean : num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw : num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian : num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean : num 100.8 101 99.7 99.9 101.1 ...
$ es_day : num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day : num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day : num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil : num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl: int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum : num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day : int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day : num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind : num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day : num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean: num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day : num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day : num 2.45 2.46 2.46 2.46 2.47 ...
$ sAnstieg_day : num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day : num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day : num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Distanz_in_km : num 24 24 24 24 24 ...
$ North : num 53.5 53.5 53.5 53.5 53.5 ...
$ East : num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m : num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km : num 20 20 20 20 20 ...
$ peatland : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m : num -1.6 0 -0.2 0.1 -0.4 ...
- attr(*, "sorted")= chr "Moor_ID"
- attr(*, ".internal.selfref")=<externalptr>
Second data.table:
str(meta)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 6 variables:
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Hoehe_Moor : num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km : num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK: num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK: num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2: num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Moor_ID"
Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] grDevices datasets splines graphics stats tcltk utils methods base
other attached packages:
[1] reshape_0.8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14
loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1
My bad. The problem was that keys were not unique:
a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
x y y.1
[1,] 1 1 3
[2,] 1 2 3
[3,] 1 1 4
[4,] 1 2 4
It would be nice if data.table could give a warning for that.
Update from Matthew
This warning has now been implemented in v1.8.7 :
New argument
allow.cartesian
( defaultFALSE
) added toX[Y]
andmerge(X,Y)
, #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values inY
joining to the same group inX
over and over again. The word cartesian is used loosely for when more thanmax(nrow(X),nrow(Y))
rows would be returned. The error message is verbose and includes advice.
这篇关于合并data.tables使用超过10 GB的RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!