合并data.tables使用超过10 GB的RAM [英] Merging data.tables uses more than 10 GB RAM

查看:122
本文介绍了合并data.tables使用超过10 GB的RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个data.tables: DT meta 。当我使用 DT [meta] 合并它们时,内存使用增加了超过10 GB(并且合并非常慢)。出了什么问题?看起来像合并是成功的,但我只能看单行,否则我耗尽内存。 DT 本身是通过合并两个data.tables而没有任何问题创建的。



编辑:



这似乎是钥匙的问题。我可以做到以下没有问题:

  DT [,id:= 1:nrow meta [,id:= 1:nrow(DT)] 
setkey(DT,id)
setkey(meta,id)

DT2 <-DT来自Matthew Dowle:
#X [Y](或合并)在1:nrow(DT)的键只是一个cbind,不是吗?

unique(DT2 [,Moor_ID,with = F] == DT2 [,Moor_ID.1,with = F])
Moor_ID
[1, TRUE

第一个数据表:

  str(DT)
Classes'data.table'和'data.frame':10212 obs。的55个变量:
$ DWD_ID:chrBremerhavBremerhavBremerhavBremerhav...
$ numdays:int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days:日期,格式:2009-09-012009-09-022009-09-032009-09-04...
$ TBoden_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw:num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian:num NA NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM:num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin:num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax:num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR:num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw:num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian:num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM:num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin:num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax:num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR:num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw: num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian:num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM:num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl:int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin:num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax:num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean:num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw:num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian:num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean:num 100.8 101 99.7 99.9 101.1 ...
$ es_day:num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day:num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day:num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil:num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl:int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum:num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day:int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day:num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind:num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day:num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean:num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day:num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day:num 2.45 2.46 2.46 2.46 2.47 .. 。
$ sAnstieg_day:num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day:num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day:num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID:chrAhlenmoorAhlenmoorAhlenmoorAhlenmoor...
$ Distanz_in_km:num 24 24 24 24 24 ...
$ North:num 53.5 53.5 53.5 53.5 53.5 ...
$ East:num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m:num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km:num 20 20 20 20 20 ...
$ peatland:logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m:num -1.6 0 -0.2 0.1 -0.4 ...
- attr *,sorted)= chrMoor_ID
-attr(*,.internal.selfref)=< externalptr>

第二个数据表:

  str(meta)
Classes'data.table'和'data.frame':10212 obs。的6个变量:
$ Moor_ID:chrAhlenmoorAhlenmoorAhlenmoorAhlenmoor...
$ Hoehe_Moor:num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km:num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK:num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK:num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2:num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*,.internal.selfref)=< externalptr>
- attr(*,sorted)= chrMoor_ID



会话信息:

  R版本2.15.1(2012-06-22)
平台:x86_64-pc-mingw32 / x64 64位)

locale:
[1] LC_COLLATE = German_Germany.1252 LC_CTYPE = German_Germany.1252 LC_MONETARY = German_Germany.1252 LC_NUMERIC = C
[5] LC_TIME = German_Germany。 1252

附加的基本包:
[1] grDevices datasets splines graphics stats tcltk utils方法base

其他附加包:
[1] reshape_0。 8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14

通过命名空间(并未附加):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1


解决方案

我的坏。问题是密钥不是唯一的:

  a <-data.table(x = c(1,1),y = c(1,2))
b <-data.table(x = c(1,1),y = c(3,4))
setkey(a,x)
setkey(b,x)
a [b]
xy y.1
[1,] 1 1 3
[2,] 1 2 3
[ ] 1 1 4
[4,] 1 2 4

.table可以给出警告。






从Matthew更新 b
$ b

此警告现已在v1.8.7中实施:


新参数 X [Y] 和<$ 中添加(默认 FALSE c $ c> merge(X,Y),#2464。防止由于错误的连接而导致的大量分配;例如, Y 中的重复键值重复加入 X 中的同一组。当大于 max(nrow(X),nrow(Y))的行将被返回时,笛卡儿字被宽松地使用。错误消息是详细的,并包含建议。



I have two data.tables: DT and meta. When I merge them using DT[meta], memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT itself was created by merging two data.tables without any problems.

Edit:

It seems to be a problem with the key. I can do the following without a problem:

DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)

DT2<-DT[meta]   # Comment from Matthew Dowle:
                # X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it? 

unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
     Moor_ID
[1,]    TRUE

First data.table:

str(DT)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  55 variables:
 $ DWD_ID                 : chr  "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
 $ numdays                : int  1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
 $ days                   : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
 $ TBoden_dayAnzahl       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TBoden_dayMin          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMax          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanAR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayStabw        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMedian       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanMM       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ T2m_dayAnzahl          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T2m_dayMin             : num  15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
 $ T2m_dayMax             : num  25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
 $ T2m_dayMeanAR          : num  19 16.9 15.6 15.2 14.8 ...
 $ T2m_dayStabw           : num  3.409 2.048 1.334 1.726 0.965 ...
 $ T2m_dayMedian          : num  17.2 16.8 15.2 14.8 14.5 ...
 $ T2m_dayMeanMM          : num  20.6 16.9 15.9 15.4 15.2 ...
 $ T10cm_dayAnzahl        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T10cm_dayMin           : num  14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
 $ T10cm_dayMax           : num  27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
 $ T10cm_dayMeanAR        : num  18.7 16.5 14.9 15.1 14.5 ...
 $ T10cm_dayStabw         : num  4.36 2.84 1.73 2.36 1.54 ...
 $ T10cm_dayMedian        : num  16.1 15.6 14.3 14.2 14 ...
 $ T10cm_dayMeanMM        : num  21 16.8 15.8 15.4 15.1 ...
 $ RF_dayAnzahl           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RF_dayMin              : num  45 58 73 56 68 62 63 44 65 58 ...
 $ RF_dayMax              : num  94 94 94 93 94 92 84 84 89 84 ...
 $ RF_dayMean             : num  68.6 76.3 78.9 74.4 86.5 ...
 $ RF_dayStabw            : num  17.09 12.53 5.88 9.83 5.62 ...
 $ RF_dayMedian           : num  64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
 $ Luftdruck_dayMean      : num  100.8 101 99.7 99.9 101.1 ...
 $ es_day                 : num  2.53 1.95 1.82 1.78 1.74 ...
 $ ea_day                 : num  1.57 1.42 1.49 1.27 1.38 ...
 $ defi_day               : num  0.956 0.535 0.327 0.509 0.355 ...
 $ Nebel_dayAnteil        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Sonnenscheind_dayAnzahl: int  18 18 18 18 18 18 18 18 18 18 ...
 $ Sonnenscheind_daySum   : num  6.63 4.93 1.05 5.82 3.27 ...
 $ julian_day             : int  244 245 246 247 248 249 250 251 252 253 ...
 $ zeta_day               : num  2.81 2.82 2.84 2.86 2.88 ...
 $ maxSonnenscheind       : num  13.9 13.8 13.7 13.6 13.5 ...
 $ R0_day                 : num  2920 2890 2860 2830 2799 ...
 $ Globalstrahlung_dayMean: num  NA NA NA NA NA NA NA NA NA NA ...
 $ RG_day                 : num  13.24 11.19 6.64 12.02 9.03 ...
 $ lambdaET_day           : num  2.45 2.46 2.46 2.46 2.47 ...
 $ sAnstieg_day           : num  0.15 0.122 0.116 0.113 0.111 ...
 $ gamma_day              : num  0.067 0.0669 0.0659 0.0661 0.0668 ...
 $ ETp_TW_day             : num  2.71 2.15 1.28 2.24 1.68 ...
 $ Moor_ID                : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Distanz_in_km          : num  24 24 24 24 24 ...
 $ North                  : num  53.5 53.5 53.5 53.5 53.5 ...
 $ East                   : num  8.58 8.58 8.58 8.58 8.58 ...
 $ Hoehe_in_m             : num  7 7 7 7 7 7 7 7 7 7 ...
 $ Kueste_km              : num  20 20 20 20 20 ...
 $ peatland               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ diffmaxt2m             : num  -1.6 0 -0.2 0.1 -0.4 ...
 - attr(*, "sorted")= chr "Moor_ID"
 - attr(*, ".internal.selfref")=<externalptr> 

Second data.table:

str(meta)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  6 variables:
 $ Moor_ID        : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Hoehe_Moor     : num  2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
 $ Kueste_km      : num  15.7 15.7 15.7 15.7 15.7 ...
 $ WSPsommer_muGOK: num  0.699 0.699 0.699 0.699 0.699 ...
 $ WSPwinter_muGOK: num  0.446 0.446 0.446 0.446 0.446 ...
 $ Moorgroesse_km2: num  59 59 59 59 59 59 59 59 59 59 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "Moor_ID"

Session info:

R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] grDevices datasets  splines   graphics  stats     tcltk     utils     methods   base     

other attached packages:
[1] reshape_0.8.4    plyr_1.7.1       data.table_1.8.0 svSocket_0.9-53  TinnR_1.0-5      R2HTML_2.2       Hmisc_3.9-3     
[8] survival_2.36-14

loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1    lattice_0.20-6 svMisc_0.9-65  tools_2.15.1 

解决方案

My bad. The problem was that keys were not unique:

a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
     x y y.1
[1,] 1 1   3
[2,] 1 2   3
[3,] 1 1   4
[4,] 1 2   4

It would be nice if data.table could give a warning for that.


Update from Matthew

This warning has now been implemented in v1.8.7 :

New argument allow.cartesian ( default FALSE) added to X[Y] and merge(X,Y), #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values in Y joining to the same group in X over and over again. The word cartesian is used loosely for when more than max(nrow(X),nrow(Y)) rows would be returned. The error message is verbose and includes advice.

这篇关于合并data.tables使用超过10 GB的RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆