data.table join(在vecseq中的错误)是关键的必要在X和i? [英] data.table join (Error in vecseq) is key necessary on both on X and i?

查看:875
本文介绍了data.table join(在vecseq中的错误)是关键的必要在X和i?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新的R和 data.table ,我觉得有用和快速。我想加入2个数据表:

 > TotFreq 
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType
1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM
2:113802 1 2013-03-24 000000000126 1.104030e -03 0 47.50 FM
3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM
4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM
5:113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM
160167:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM
160168:113802 7 2013-03 -23 978141970584 1.770796e-06 25 1.00 SM
160169:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM
160170:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM
160171:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM

> Count_SM_FM
Legacy_Store_Number WeekDay ItemType物件
1:113802 1 SM 12305
2:113802 1 FM 1942
3:113802 2 SM 11014
4:113802 2 FM 1398
5:113802 3 SM 10154
6:113802 3 FM 1117
7:113802 4 SM 10414
8:113802 4 FM 1167
9:113802 5 SM 10258
10:113802 5 FM 1200
11:113802 6 SM 11116
12:113802 6 FM 1575
13:113802 7 SM 13098
14:113802 7 FM 2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
>
> ResultJoin< - TotFreq [Count_SM_FM]
在vecseq中出错(f__,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:
加入结果为320342行;超过160171 = max(nrow(x),nrow(i))。检查i中的重复键值,每次重复加入x中的同一个组,如果可以,尝试包括`j`如果你确定你想继续,重新运行allow.cartesian = TRUE,否则,请搜索这个错误消息

但是我没有重复的键 i !



使用:

 > ResultJoin <-TotFreq [Count_SM_FM,allow.cartesian = T] 
>
> ResultJoin
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType。 1目录
1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305
2:113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305
3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305
4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305
---
320338:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326
320339:113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326
320340:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326
320341:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326
320342:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326

我实际上是原来的 TotFreq 表中的记录的两倍。如果我在 Count_SM_FM 上添加键,则连接正常工作:

  ; setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin< - TotFreq [Count_SM_FM]
>
> ResultJoin
Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems
1:113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942
2:113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942
3:113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942
4:113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942
5:113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942
---
160167:113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098
160168:113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098
160169:113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098
160170:113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098
160171:113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098

我试图验证一个例子,也许问题是没有关键变量作为 TotFreq 或没有 Count_SM_FM 未排序,但无法重现此错误

 > daysType<  -  data.table(
+ key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2 = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+ key3 = c(b,a a,b,a,b,a,b,a,b,a,b
+ var1 = c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+)
>
>
> detailData< - data.table(
+ key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1),
+ key2 = c(1,1,1,1,1,1,1,2,2,2,2,2,2 ,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6 ,6,6,6,7,7,7,7,7,7,7,7),
+ var2 = c(10,11,12,13,15,16,17,10,11 ,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16 ,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+ var3 = c(1,2,4,6,6 ,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7 ,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+ key3 = c( a,a,a,a,b,b,b,a ,a,a,a,a,b,b,a b,b,a,a,a,a,b,b ,a,b,b,b,b)
+ )
>
> setkey(detailData,key1,key2,key3)
> JoinResult < - detailData [daysType]

问题与问题不同


加入两个data.tables失败


allow.cartesian 解决了问题。



这里有什么问题?为什么要添加键到 Count_SM_FM 可以解决它?



谢谢!

解决方案

/ strong> Arun在v1.9.5中修复:


allow.cartesian 现在在 i 没有重复项时被忽略, #742 #508 。感谢@nigmastar,@ user3645882和其他报告。










上一个答案...



首先让我们来处理 allow.cartesian 部分。错误消息应该可以改变,指出你可以得到大尺寸,即使你没有在 i 中的重复,但你有左侧的重复 data.table 。这里有一个简单的例子:

  dt1 = data.table(a = c(1,1),b = key ='a')
dt2 = data.table(a = c(1,2),c = 3:4)

dt1 [dt2]#这给出一个错误,因为连接结果为3行,如下所示

dt1 [dt2,allow.cartesian = TRUE]
#abc
#1:1 1 3
#2: 1 2 3
#3:2 NA 4

现在, - 不需要为 i 设置键,它将假定前几列是键。查看您的第一个加入结果,可以看到已加入 ItemType ,并且您使用的是旧的 data.table version(我使用的是1.9.3)。所以我的猜测是,或者你没有真正设置的键正确,并没有包括 ItemType 或者有一些错误在旧版本,自那以后被修复。 p>

I am new to R and to data.table, which I find useful and fast. I am trying to join 2 data tables:

> TotFreq
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM
160167:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM
160168:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM
160169:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM
160170:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM
160171:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM

> Count_SM_FM
    Legacy_Store_Number WeekDay ItemType ObjItems
 1:              113802       1       SM    12305
 2:              113802       1       FM     1942
 3:              113802       2       SM    11014
 4:              113802       2       FM     1398
 5:              113802       3       SM    10154
 6:              113802       3       FM     1117
 7:              113802       4       SM    10414
 8:              113802       4       FM     1167
 9:              113802       5       SM    10258
10:              113802       5       FM     1200
11:              113802       6       SM    11116
12:              113802       6       FM     1575
13:              113802       7       SM    13098
14:              113802       7       FM     2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> 
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

But I do not have duplicate keys in i!

Using:

> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
> 
> ResultJoin
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM         SM    12305
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM         SM    12305
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM         SM    12305
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM         SM    12305
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM         SM    12305
    ---                                                                                                                      
320338:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM         FM     2326
320339:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM         FM     2326
320340:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM         FM     2326
320341:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM         FM     2326
320342:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM         FM     2326

I get in fact double the records that I had in my original TotFreq table. If I add a key also on Count_SM_FM the join works:

> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
> 
> ResultJoin
        Legacy_Store_Number WeekDay ItemType       Date     Item_Key        Distr NoSellingDays meanUnits ObjItems
     1:              113802       1       FM 2013-03-24 000000000120 2.428985e-04             0      8.00     1942
     2:              113802       1       FM 2013-03-24 000000000126 1.104030e-03             0     47.50     1942
     3:              113802       1       FM 2013-03-24 000000000170 1.126004e-03             0     48.75     1942
     4:              113802       1       FM 2013-03-24 000000000180 5.143034e-04             0     19.00     1942
     5:              113802       1       FM 2013-03-24 000000000260 3.854306e-04             0     12.25     1942
    ---                                                                                                           
160167:              113802       7       SM 2013-03-23 978125002327 5.902655e-07            27      1.00    13098
160168:              113802       7       SM 2013-03-23 978141970584 1.770796e-06            25      1.00    13098
160169:              113802       7       SM 2013-03-23 978145300697 1.180531e-06            26      1.00    13098
160170:              113802       7       SM 2013-03-23 978145552558 5.902655e-07            27      1.00    13098
160171:              113802       7       SM 2013-03-23 978160139536 5.902655e-07            27      1.00    13098

I tried to verify with an example, maybe the problem was in not having the key variables as the first columns in TotFreq or in not having Count_SM_FM not sorted but I couldn't reproduce the error

> daysType <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+     key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+     var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )        
> 
> 
> detailData <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+     var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+     var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+     key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )        
> 
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]

The problem is different than that in question

Join of two data.tables fails

because there the allow.cartesian solves the issue.

What is the problem here? Why adding the key to Count_SM_FM solves it?

Thanks!

解决方案

Update Oct 2014: Arun fixed it in v1.9.5 :

allow.cartesian is now ignored when i has no duplicates, #742 and #508. Thanks to @nigmastar, @user3645882 and others for the reports.



Previous answer ...

First let's address the allow.cartesian part. The error message should probably be changed to point out that you can get large sizes even if you don't have duplicates in i, but you have duplicates in the left hand side data.table. Here's a simple example:

dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)

dt1[dt2] # this gives an error, because join results in 3 rows, as seen below

dt1[dt2, allow.cartesian = TRUE]
#   a  b c
#1: 1  1 3
#2: 1  2 3
#3: 2 NA 4

Now as far as setting the key goes - no you don't need to set the key for i, it will just assume the first few columns are the keys. Looking at your first join result one can see that it was not joined on ItemType and that you're using an older data.table version (I'm using 1.9.3). So my guess is that either you didn't actually set the key correctly and didn't include ItemType or there was some bug in older versions that's been fixed since then.

这篇关于data.table join(在vecseq中的错误)是关键的必要在X和i?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆