奇怪的问题与data.table行搜索 [英] Strange issue with data.table row search

查看:109
本文介绍了奇怪的问题与data.table行搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个大粉丝和大量用户的data.tables在R.我真的使用他们的很多代码,但最近遇到一个奇怪的错误:



我有一个巨大的data.table与多列,例如:

  xy 
1:1 a
2:1 b
3:1 c
4:2 a
5:2 b
6:2 c
7:3 a
8: 3 b
9:3 c

如果我选择

  dataDT [x =='1'] 


$ b b

我最后得到

  xy 
1:1 a



  dataDT [ ='1')] 

给我

  xy 
1:1 a
2:1 b
3:1 c

任何想法? x和y是因子,data.table由 setKey x 索引。



其他信息和代码:



我实际上解决了这个问题,但不清楚也不直观。



我的代码结构如下:我有一个从我的主代码调用的函数,我必须在data.table中引入一个列。



我以前使用过以下符号



dataT [,nC:= oC,]





我发现使用

创建新列> dataT $ nC <-dataT $ oC



而是完全修复错误。



在一个更简单的示例代码完全相同的bug,但我不能,可能是因为依赖关系到我的data.table的大小结构以及我在我的表上运行的具体功能。



有了这个说法,我有一个工作示例,显示当您使用dataT [,nC:= oC,]符号插入列时,它的行为就像表通过引用传递的函数,而不是按值。



另外,有趣的是,在执行



dataDT [x =='1']



vs



dataDT [(x =='1')]



显示相同的结果,后者是慢10倍,我之前注意到。我希望这段代码可以轻松一点。

  rm(list = ls())
library )


superParF < - function(dtInput){

dtInputP < - dtInput [a == 1]
dtInputN < - dtInput [a == 2]

outDT < - rbind(dtInputP [,sum(y),by ='x'],
dtInputN [,sum(y) x'])
return(outDT)
}

superFunction< - function(dtInput){

#create new column
dtInput [,z:= y,]

#run函数
outDT < - rbindlist(lapply(inputDT $ x),
function(i)
superParF(inputDT [x == i])))
#output outDT
return(outDT)
}



b $ b inputDT < - data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep 4,100000),
rep(5,100000)),
y = c(rep(1:100000,5)))

inputDT $ x < factor(inputDT $ x)
inputDT $ y< - as.numeric(inputDT $ y)

inputDT < - rbind(inputDT,inputDT)
inputDT $ a ; - c(rep(1,500000),rep(2,500000))

setkey(inputDT,x)

#两个搜索不具有相同的性能

a< - system.time(inputDT [x =='1'])
b< - system.time(inputDT [ =='1')]

print(a)
print(b)

out< - superFunction(inputDT)

a< - system.time(inputDT [x =='1'])
b< - system.time(inputDT [(x =='1')])

print(a)
print(b)

inputDT


解决方案

我在评论中要求提供版本号,并遵循支持页面。它包含:


阅读并搜索README.md。是否有与您的问题相关的错误修复或新功能?可能我们知道这个问题或有人报告了它,我们已经解决了当前开发版本中的问题。


在浏览器中使用Ctrl-F搜索README.md中的字符串index,得到:


21自动索引处理逻辑子集的因子列正确使用数值,#1361。感谢@mplatzer。



26当输入data.table已排序时,自动索引将正确返回子集的顺序,#1495。感谢@huashan为好的
可重现的例子。


这些固定在v1.9.7中, 安装页面。



第一个(项目21)看起来可疑地接近你的问题。因此,请按照第4点的支持页面上的要求尝试v1.9.7。



我们要求您预先陈述版本号以节省时间,因为我们希望确保您至少在CRAN上使用v1.9.6,而不是在v1.9.4上使用此问题:


DT [column == value] no更长的回收值,除非在长度1情况下(当它仍然使用DT的密钥或自动次级密钥,如v1.9.4中所介绍的)。如果length(value)== length(column),那么它在R中作为标准元素操作。否则,发出长度错误以避免常见的用户错误。 DT [%列中的列%]仍然使用DT的键(或自动辅助键)。自动索引(即,优化==和%in%)仍然可以使用选项(datatable.auto.index = FALSE)关闭。


那么你运行哪个版本,并且你试过v1.9.7,因为它看起来像是值得一试吗?


I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:

I have a huge data.table with multiple columns, example:

   x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c

if I select

dataDT[x==‘1’]  

I end up getting

   x y
1: 1 a

whereas

dataDT[(x==‘1’)]

gives me

   x y
1: 1 a
2: 1 b
3: 1 c

Any ideas? x and y are factor and the data.table is indexed by setKey by x.

ADDITIONAL INFOS AND CODE:

I actually fixed this issue but in a way that is not clear nor intuitive.

My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.

I have previously used the following notation

dataT[,nC:=oC,]

to do the deed.

I have instead found that creating the new column by using

dataT$nC <- dataT$oC

instead fixes the bug completely.

I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.

With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.

Also, interestingly enough, while performing

dataDT[x==‘1’]

vs

dataDT[(x==‘1’)]

shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.

rm(list=ls())
library(data.table)


superParF <- function(dtInput){

  dtInputP <- dtInput[a==1]
  dtInputN <- dtInput[a==2]

  outDT    <- rbind(dtInputP[,sum(y),by='x'],
                    dtInputN[,sum(y),by='x'])
  return(outDT)
}

superFunction <- function(dtInput){

  #create new column
  dtInput[,z:=y,]

  #run function
  outDT <- rbindlist(lapply(unique(inputDT$x),
                        function(i)
                          superParF(inputDT[x==i])))
  #output outDT
  return(outDT)
}




inputDT <- data.table(x = c(rep(1,100000),
                        rep(2,100000),
                        rep(3,100000),
                        rep(4,100000),
                        rep(5,100000)),
                  y= c(rep(1:100000,5)))

inputDT$x <-  as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)

inputDT   <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))

setkey(inputDT,x)

#first observation-> the two searches do not work with the same performance

a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])

print(a)
print(b)

out <- superFunction(inputDT)

a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])

print(a)
print(b)

inputDT

解决方案

I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :

Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.

So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :

21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks @mplatzer.

26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks @huashan for the nice reproducible example.

Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.

The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.

We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :

DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).

So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?

这篇关于奇怪的问题与data.table行搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆