data.table join(在vecseq中的错误)是关键的必要在X和i? [英] data.table join (Error in vecseq) is key necessary on both on X and i?
问题描述
我是新的R和 data.table
,我觉得有用和快速。我想加入2个数据表:
> TotFreq
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType
1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM
2:113802 1 2013-03-24 000000000126 1.104030e -03 0 47.50 FM
3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM
4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM
5:113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM
160167:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM
160168:113802 7 2013-03 -23 978141970584 1.770796e-06 25 1.00 SM
160169:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM
160170:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM
160171:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM
> Count_SM_FM
Legacy_Store_Number WeekDay ItemType物件
1:113802 1 SM 12305
2:113802 1 FM 1942
3:113802 2 SM 11014
4:113802 2 FM 1398
5:113802 3 SM 10154
6:113802 3 FM 1117
7:113802 4 SM 10414
8:113802 4 FM 1167
9:113802 5 SM 10258
10:113802 5 FM 1200
11:113802 6 SM 11116
12:113802 6 FM 1575
13:113802 7 SM 13098
14:113802 7 FM 2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
>
> ResultJoin< - TotFreq [Count_SM_FM]
在vecseq中出错(f__,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:
加入结果为320342行;超过160171 = max(nrow(x),nrow(i))。检查i中的重复键值,每次重复加入x中的同一个组,如果可以,尝试包括`j`如果你确定你想继续,重新运行allow.cartesian = TRUE,否则,请搜索这个错误消息
但是我没有重复的键 i !
使用:
> ResultJoin <-TotFreq [Count_SM_FM,allow.cartesian = T]
>
> ResultJoin
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType。 1目录
1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305
2:113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305
3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305
4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305
---
320338:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326
320339:113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326
320340:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326
320341:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326
320342:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326
我实际上是原来的 TotFreq
表中的记录的两倍。如果我在 Count_SM_FM
上添加键,则连接正常工作:
; setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin< - TotFreq [Count_SM_FM]
>
> ResultJoin
Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems
1:113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942
2:113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942
3:113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942
4:113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942
5:113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942
---
160167:113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098
160168:113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098
160169:113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098
160170:113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098
160171:113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098
我试图验证一个例子,也许问题是没有关键变量作为 TotFreq
或没有 Count_SM_FM
未排序,但无法重现此错误
> daysType< - data.table(
+ key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2 = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+ key3 = c(b,a a,b,a,b,a,b,a,b,a,b
+ var1 = c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+)
>
>
> detailData< - data.table(
+ key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1),
+ key2 = c(1,1,1,1,1,1,1,2,2,2,2,2,2 ,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6 ,6,6,6,7,7,7,7,7,7,7,7),
+ var2 = c(10,11,12,13,15,16,17,10,11 ,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16 ,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+ var3 = c(1,2,4,6,6 ,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7 ,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+ key3 = c( a,a,a,a,b,b,b,a ,a,a,a,a,b,b,a b,b,a,a,a,a,b,b ,a,b,b,b,b)
+ )
>
> setkey(detailData,key1,key2,key3)
> JoinResult < - detailData [daysType]
问题与问题不同
加入两个data.tables失败
allow.cartesian
解决了问题。
这里有什么问题?为什么要添加键到 Count_SM_FM
可以解决它?
谢谢!
/ strong> Arun在v1.9.5中修复:
allow.cartesian
现在在i
没有重复项时被忽略, #742 和#508 。感谢@nigmastar,@ user3645882和其他报告。
上一个答案...
首先让我们来处理 allow.cartesian
部分。错误消息应该可以改变,指出你可以得到大尺寸,即使你没有在 i
中的重复,但你有左侧的重复 data.table
。这里有一个简单的例子:
dt1 = data.table(a = c(1,1),b = key ='a')
dt2 = data.table(a = c(1,2),c = 3:4)
dt1 [dt2]#这给出一个错误,因为连接结果为3行,如下所示
dt1 [dt2,allow.cartesian = TRUE]
#abc
#1:1 1 3
#2: 1 2 3
#3:2 NA 4
现在, - 不需要为 i
设置键,它将假定前几列是键。查看您的第一个加入结果,可以看到不已加入 ItemType
,并且您使用的是旧的 data.table
version(我使用的是1.9.3)。所以我的猜测是,或者你没有真正设置的键正确,并没有包括 ItemType
或者有一些错误在旧版本,自那以后被修复。 p>
I am new to R and to data.table
, which I find useful and fast. I am trying to join 2 data tables:
> TotFreq
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM
160167: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM
160168: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM
160169: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM
160170: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM
160171: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM
> Count_SM_FM
Legacy_Store_Number WeekDay ItemType ObjItems
1: 113802 1 SM 12305
2: 113802 1 FM 1942
3: 113802 2 SM 11014
4: 113802 2 FM 1398
5: 113802 3 SM 10154
6: 113802 3 FM 1117
7: 113802 4 SM 10414
8: 113802 4 FM 1167
9: 113802 5 SM 10258
10: 113802 5 FM 1200
11: 113802 6 SM 11116
12: 113802 6 FM 1575
13: 113802 7 SM 13098
14: 113802 7 FM 2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
>
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
But I do not have duplicate keys in i
!
Using:
> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
>
> ResultJoin
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305
---
320338: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326
320339: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326
320340: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326
320341: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326
320342: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326
I get in fact double the records that I had in my original TotFreq
table. If I add a key also on Count_SM_FM
the join works:
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
>
> ResultJoin
Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems
1: 113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942
2: 113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942
3: 113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942
4: 113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942
5: 113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942
---
160167: 113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098
160168: 113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098
160169: 113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098
160170: 113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098
160171: 113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098
I tried to verify with an example, maybe the problem was in not having the key variables as the first columns in TotFreq
or in not having Count_SM_FM
not sorted but I couldn't reproduce the error
> daysType <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+ key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+ var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )
>
>
> detailData <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+ var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+ var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+ key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )
>
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]
The problem is different than that in question
Join of two data.tables fails
because there the allow.cartesian
solves the issue.
What is the problem here? Why adding the key to Count_SM_FM
solves it?
Thanks!
Update Oct 2014: Arun fixed it in v1.9.5 :
allow.cartesian
is now ignored wheni
has no duplicates, #742 and #508. Thanks to @nigmastar, @user3645882 and others for the reports.
Previous answer ...
First let's address the allow.cartesian
part. The error message should probably be changed to point out that you can get large sizes even if you don't have duplicates in i
, but you have duplicates in the left hand side data.table
. Here's a simple example:
dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)
dt1[dt2] # this gives an error, because join results in 3 rows, as seen below
dt1[dt2, allow.cartesian = TRUE]
# a b c
#1: 1 1 3
#2: 1 2 3
#3: 2 NA 4
Now as far as setting the key goes - no you don't need to set the key for i
, it will just assume the first few columns are the keys. Looking at your first join result one can see that it was not joined on ItemType
and that you're using an older data.table
version (I'm using 1.9.3). So my guess is that either you didn't actually set the key correctly and didn't include ItemType
or there was some bug in older versions that's been fixed since then.
这篇关于data.table join(在vecseq中的错误)是关键的必要在X和i?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!