使用data.table子集确定不相等 [英] Using data.table Subsettting for non equality

查看:136
本文介绍了使用data.table子集确定不相等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表有400k行,我做子集,它是非常慢。



下面是一个示例数据框:

 日期名称值size car1 car2 
1 2015-01-01 07:44:00 bob 1 5 AD
2 2015-02-02 09:46:00 george 522 2 BF



现在我使用subset()来缓慢地使用它:

  main<  -  data.frame(date = as.POSIXct(c(2015-01-01 07:44:00,2015-02-02 09:46:00),tz = GMT),name = c(bob,george),value = c(1,522),size = c(5,2),car1 = c(A,B),car2 = c (D,F))
main $ date
subset(main,size> 1
& value == 522
& name ==george
& date> = as.POSIXct(2015-01-01 03:44:00,tz =GMT)& date> = as.POSIXct(2015-01-01 08:44:00,tz =GMT)
&(car1 ==F| car2 ==F)


日期名称值size car1 car2
2 2015-02-02 09:46:00 george 522 2 BF



由于对另一个问题的一些响应使用data.table看起来要快得多,所以我想使用data.table做同样的事情,但我有一堆问题。



这里是我到目前为止:

 表格)
mdt< - as.data.table(main)
setkey(mdt,date,name,value,size,car1,car2)
mdt [ 2015-01-01 03:44:00),george,522,2,F,F)]

这会返回:

 日期名称值大小car1 car2 
1: 01-01 03:44:00 george 522 2 NA F

这里是我的问题:



(1)我想有一个条件,其中日期> =和日期<=但是这可能使用data.table?如果没有任何想法如何使子集化更快?



(2)我想有一个标准where(car1 ==F| car2 ==F )但这是可能吗?如果没有任何想法如何使子集化更快?



(3)您可以看到mdt []的输出有一个日期2015-01-01 03 :44:00,但此日期不在原始的主数据帧中。这里发生了什么?



(4)你可以在mdt []的输出中看到car1值为NA,当car1在原始主数据帧。

解决方案

当然,您只需将标准放在 i 表达式中。

  setDT )
main [size> 1&
value == 522&
name ==george&
date> = as.POSIXct(2015-01-01 03:44:00,tz =GMT)&
date> = as.POSIXct(2015-01-01 08:44:00,tz =GMT)&
(car1 ==F| car2 ==F),]

结果:

 日期名称值size car1 car2 
1:2015-02-02 09:46:00 george 522 2 BF

因此,比 >? Yup。

  library(data.table)
库(ggplot2)
库(reshape2)

set.seed(1)

cf< - function(n){
main< -
data.frame(date = as.POSIXct (Sys.Date()+ runif(n,0,100)),
name = sample(c(bob,george),n,replace = T),
value = round (n,400,600),0),
size = sample(1:5,n,replace = T),
car1 = sample(LETTERS [1:6],n,replace = T ),
car2 = sample(LETTERS [1:6],n,replace = T),
stringsAsFactors = F)
mdt< - data.table(main)
setkey(mdt,date,name,value,size,car1,car2)

pre< - Sys.time()
mdt [size& 1&值> 100& name ==george&
date> = as.POSIXct(Sys.Date())& date< = as.POSIXct(Sys.Date()+ 50)&
(car1 ==F| car2 ==F),]
dt_time < - Sys.time() - pre

pre< time()
subset(main,
size> 1& value> 100& name ==george&
date> = as.POSIXct ())& date< = as.POSIXct(Sys.Date()+ 50)&
(car1 ==F| car2 ==F))
subset_time& - Sys.time() - pre

return(c(n = n,dt_time = dt_time,subset_time = subset_time))
}

result< sapply(10 ^(2:7),cf)
result< - melt(data.frame(t(result)),id.var ='n')

ggplot aes(x = n,y = value,color = variable))+
geom_point()+ geom_line()+ theme_bw()+
scale_x_log10()


I have a datatable with 400k rows and I am doing subsetting and it is very slow.

Here is an a sample data frame:

                 date   name value size car1 car2
1 2015-01-01 07:44:00    bob     1    5    A    D
2 2015-02-02 09:46:00 george   522    2    B    F

Now I subset it the slow way using subset():

main<- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00","2015-02-02 09:46:00"),tz="GMT"),name= c("bob","george"),value=c(1,522), size= c(5,2), car1=c("A","B"), car2=c("D","F"))
main$date
subset(main,    size >1 
       &  value == 522
       &  name == "george" 
       &  date >= as.POSIXct("2015-01-01 03:44:00",tz="GMT") &  date >= as.POSIXct("2015-01-01 08:44:00",tz="GMT")
       &  (car1 == "F" | car2 == "F")
)

                 date   name value size car1 car2
2 2015-02-02 09:46:00 george   522    2    B    F

This works and returns 1 row but it is very slow.

Thanks to some responses on another question using data.table looks to be much faster so I would like to use data.table to do the same thing as above but I have a bunch of questions.

Here is what I so far:

   library(data.table)  
 mdt<- as.data.table(main)
 setkey(mdt, date, name, value,size,car1,car2)
  mdt[.(as.POSIXct("2015-01-01 03:44:00"),"george", 522,2,"F","F")]

This returns:

date   name value size car1 car2
1: 2015-01-01 03:44:00 george   522    2   NA    F

Here are my questions:

(1) I want to have a criteria where date >= and date <= but is this possible using data.table? If not any ideas how to make the subsetting faster?

(2) I want to have a criteria where (car1 == "F" | car2 == "F") but is this possible? If not any ideas how to make the subsetting faster?

(3) You can see the output of the mdt[] there is a date of 2015-01-01 03:44:00 but this date IS NOT in the original "main" dataframe. What is happening here?

(4) You can see in the output of the mdt[] there is a car1 value of NA when car1 is not NA in the original "main" dataframe. What is happening here?

Thank you.

解决方案

Sure, you just put the criteria in the i expression.

setDT(main)
main[size >1 &
       value == 522 &
       name == "george" &
       date >= as.POSIXct("2015-01-01 03:44:00",tz="GMT") &
       date >= as.POSIXct("2015-01-01 08:44:00",tz="GMT") &
       (car1 == "F" | car2 == "F"), ]

Result:

                  date   name value size car1 car2
1: 2015-02-02 09:46:00 george   522    2    B    F

So, is that faster than subset? Yup.

library(data.table)
library(ggplot2)
library(reshape2)

set.seed(1)

cf <- function(n) {
  main <- 
    data.frame(date = as.POSIXct(Sys.Date()+runif(n, 0, 100)),
               name = sample(c("bob","george"), n, replace=T),
               value = round(runif(n, 400,600), 0), 
               size= sample(1:5, n, replace=T), 
               car1= sample(LETTERS[1:6], n, replace=T), 
               car2= sample(LETTERS[1:6], n, replace=T),
               stringsAsFactors=F)
  mdt <- data.table(main)
  setkey(mdt, date, name, value,size,car1,car2)

  pre <- Sys.time()
  mdt[size > 1 & value > 100  & name == "george" &
         date >= as.POSIXct(Sys.Date()) & date <= as.POSIXct(Sys.Date()+50) &
         (car1 == "F" | car2 == "F"), ]
  dt_time <- Sys.time() - pre

  pre <- Sys.time()
  subset(main, 
         size > 1 & value > 100 & name == "george" &
         date >= as.POSIXct(Sys.Date()) & date <= as.POSIXct(Sys.Date()+50) &
         (car1 == "F" | car2 == "F"))
  subset_time <- Sys.time() - pre

  return(c(n=n, dt_time=dt_time, subset_time=subset_time))
}

result <- sapply(10^(2:7), cf)
result <- melt(data.frame(t(result)), id.var='n')

ggplot(result, aes(x=n, y=value, color=variable)) +
  geom_point() + geom_line() + theme_bw() +
  scale_x_log10()

这篇关于使用data.table子集确定不相等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆