合并 data.tables 使用超过 10 GB RAM [英] Merging data.tables uses more than 10 GB RAM

查看:15
本文介绍了合并 data.tables 使用超过 10 GB RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 data.tables:DTmeta.当我使用 DT[meta] 合并它们时,内存使用量增加了 10 GB 以上(并且合并非常慢).怎么了?好像合并成功了,但是只能看单行,不然内存不够.DT本身是通过合并两个data.tables创建的,没有任何问题.

I have two data.tables: DT and meta. When I merge them using DT[meta], memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT itself was created by merging two data.tables without any problems.

钥匙好像有问题.我可以毫无问题地执行以下操作:

It seems to be a problem with the key. I can do the following without a problem:

DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)

DT2<-DT[meta]   # Comment from Matthew Dowle:
                # X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it? 

unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
     Moor_ID
[1,]    TRUE

第一个数据表:

str(DT)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  55 variables:
 $ DWD_ID                 : chr  "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
 $ numdays                : int  1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
 $ days                   : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
 $ TBoden_dayAnzahl       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TBoden_dayMin          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMax          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanAR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayStabw        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMedian       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ TBoden_dayMeanMM       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ T2m_dayAnzahl          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T2m_dayMin             : num  15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
 $ T2m_dayMax             : num  25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
 $ T2m_dayMeanAR          : num  19 16.9 15.6 15.2 14.8 ...
 $ T2m_dayStabw           : num  3.409 2.048 1.334 1.726 0.965 ...
 $ T2m_dayMedian          : num  17.2 16.8 15.2 14.8 14.5 ...
 $ T2m_dayMeanMM          : num  20.6 16.9 15.9 15.4 15.2 ...
 $ T10cm_dayAnzahl        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ T10cm_dayMin           : num  14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
 $ T10cm_dayMax           : num  27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
 $ T10cm_dayMeanAR        : num  18.7 16.5 14.9 15.1 14.5 ...
 $ T10cm_dayStabw         : num  4.36 2.84 1.73 2.36 1.54 ...
 $ T10cm_dayMedian        : num  16.1 15.6 14.3 14.2 14 ...
 $ T10cm_dayMeanMM        : num  21 16.8 15.8 15.4 15.1 ...
 $ RF_dayAnzahl           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ RF_dayMin              : num  45 58 73 56 68 62 63 44 65 58 ...
 $ RF_dayMax              : num  94 94 94 93 94 92 84 84 89 84 ...
 $ RF_dayMean             : num  68.6 76.3 78.9 74.4 86.5 ...
 $ RF_dayStabw            : num  17.09 12.53 5.88 9.83 5.62 ...
 $ RF_dayMedian           : num  64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
 $ Luftdruck_dayMean      : num  100.8 101 99.7 99.9 101.1 ...
 $ es_day                 : num  2.53 1.95 1.82 1.78 1.74 ...
 $ ea_day                 : num  1.57 1.42 1.49 1.27 1.38 ...
 $ defi_day               : num  0.956 0.535 0.327 0.509 0.355 ...
 $ Nebel_dayAnteil        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Sonnenscheind_dayAnzahl: int  18 18 18 18 18 18 18 18 18 18 ...
 $ Sonnenscheind_daySum   : num  6.63 4.93 1.05 5.82 3.27 ...
 $ julian_day             : int  244 245 246 247 248 249 250 251 252 253 ...
 $ zeta_day               : num  2.81 2.82 2.84 2.86 2.88 ...
 $ maxSonnenscheind       : num  13.9 13.8 13.7 13.6 13.5 ...
 $ R0_day                 : num  2920 2890 2860 2830 2799 ...
 $ Globalstrahlung_dayMean: num  NA NA NA NA NA NA NA NA NA NA ...
 $ RG_day                 : num  13.24 11.19 6.64 12.02 9.03 ...
 $ lambdaET_day           : num  2.45 2.46 2.46 2.46 2.47 ...
 $ sAnstieg_day           : num  0.15 0.122 0.116 0.113 0.111 ...
 $ gamma_day              : num  0.067 0.0669 0.0659 0.0661 0.0668 ...
 $ ETp_TW_day             : num  2.71 2.15 1.28 2.24 1.68 ...
 $ Moor_ID                : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Distanz_in_km          : num  24 24 24 24 24 ...
 $ North                  : num  53.5 53.5 53.5 53.5 53.5 ...
 $ East                   : num  8.58 8.58 8.58 8.58 8.58 ...
 $ Hoehe_in_m             : num  7 7 7 7 7 7 7 7 7 7 ...
 $ Kueste_km              : num  20 20 20 20 20 ...
 $ peatland               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ diffmaxt2m             : num  -1.6 0 -0.2 0.1 -0.4 ...
 - attr(*, "sorted")= chr "Moor_ID"
 - attr(*, ".internal.selfref")=<externalptr> 

第二个data.table:

Second data.table:

str(meta)
Classes ‘data.table’ and 'data.frame':  10212 obs. of  6 variables:
 $ Moor_ID        : chr  "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
 $ Hoehe_Moor     : num  2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
 $ Kueste_km      : num  15.7 15.7 15.7 15.7 15.7 ...
 $ WSPsommer_muGOK: num  0.699 0.699 0.699 0.699 0.699 ...
 $ WSPwinter_muGOK: num  0.446 0.446 0.446 0.446 0.446 ...
 $ Moorgroesse_km2: num  59 59 59 59 59 59 59 59 59 59 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "Moor_ID"

会话信息:

R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] grDevices datasets  splines   graphics  stats     tcltk     utils     methods   base     

other attached packages:
[1] reshape_0.8.4    plyr_1.7.1       data.table_1.8.0 svSocket_0.9-53  TinnR_1.0-5      R2HTML_2.2       Hmisc_3.9-3     
[8] survival_2.36-14

loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1    lattice_0.20-6 svMisc_0.9-65  tools_2.15.1 

推荐答案

我的错.问题是键不是唯一的:

My bad. The problem was that keys were not unique:

a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
     x y y.1
[1,] 1 1   3
[2,] 1 2   3
[3,] 1 1   4
[4,] 1 2   4

如果 data.table 能给出警告就好了.

It would be nice if data.table could give a warning for that.

来自马修的更新

此警告现已在 v1.8.7 中实施:

This warning has now been implemented in v1.8.7 :

新参数 allow.cartesian(默认 FALSE)添加到 X[Y]merge(X,Y),#2464.防止由于指定错误的连接而导致大量分配;例如,将 Y 中的重复键值一遍又一遍地加入 X 中的同一组.当返回多于 max(nrow(X),nrow(Y)) 行时,单词 cartesian 被松散地使用.错误消息很冗长,包括建议.

New argument allow.cartesian ( default FALSE) added to X[Y] and merge(X,Y), #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values in Y joining to the same group in X over and over again. The word cartesian is used loosely for when more than max(nrow(X),nrow(Y)) rows would be returned. The error message is verbose and includes advice.

这篇关于合并 data.tables 使用超过 10 GB RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆