使用data.table而不是data.frame进行子集 [英] Subsetting using data.table instead of data.frame

查看:89
本文介绍了使用data.table而不是data.frame进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我处理的数据帧有300万行和10列,我正在做一些子集。我下面有一些玩具代码。当我子集它需要很长时间。如果我使用data.table和datas.table上的子集会更快?这是一些玩具代码:

  s <-c(100,100,100,800,800,6662,33565,265653262,266532)
p< ; -c(5,5,5,10,10,10,8,9,10)
name< -c(bob,bob,bob,ed,ed ,ed,joe,frank,ted)
time< - as.POSIXct(as.character(c(2014-10-27 18:11:36 PDT,2014 -10-27 18:11:37 PDT,2014-10-27 18:11:38 PDT,2014-10-27 18:11:39 PDT,2014-10-27 18:11: 40 PDT,2014-10-27 18:11:41 PDT,2014-10-27 19:11:36 PDT,2014-10-27 20:11:36 PDT,2014-10 -27 21:11:36 PDT)))
dat < - data.frame(s,p,name,time)
dat

这里是数据框:

  sp name time 
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11 :38
4 800 10 ed 2014-10-27 18:11:39
5 800 10 ed 2014-10-27 18:11:40
6 6662 10 ed 2014-10- 27 18:11:41
7 33565 8 joe 2014-10-27 19:11:36
8 265653262 9 frank 2014-10-27 20:11:36
9 266532 10 ted 2014-10-27 21:11:36

现在我在数据框架上的子集:

  result<  -  subset(dat,as.numeric(s)== 100 
& p == 5
& name ==bob
& time> =2014-10-27 18:11:36 PDT
& time< =2014-10-27 18:12:00 PDT

result

sp name time
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11:38



如何使用data.table做类似的事情?



谢谢。 / p>

解决方案

很好,你的示例代码实际上打破了数据框感谢时间选择器 - 你试图匹配POSIXlt日期(在数据帧中)与字符串(在选择器中)。我想你想要:

  result<  -  subset(dat,as.numeric(s)== 100 
& p == 5
& name ==bob
& time> = as.POSIXlt(2014-10-27 18:11:36 PDT)
& time< = as.POSIXlt(2014-10-27 18:12:00 PDT)


结果
sp名称时间
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11:38

此语法对data.tables非常有效:

  dat < -  as.data.table(dat)
result< - subset(dat,
as.numeric(s)== 100
& p == 5
& name ==bob
& time> = as.POSIXlt(2014-10-27 18:11:36 PDT)
& time< = as.POSIXlt(2014-10-27 18:12:00 PDT)

结果

sp名称时间
1:100 5 bob 2014-10-27 18:11:36
2:100 5 bob 2014-10-27 18:11:37
3:100 5 bob 2014-10-27 18: 11:38

如果你想要更多的数据表格,你可以避免而是直接对data.table进行操作:

  dat < -  as.data.table(dat)
result< - dat [as.numeric(s)== 100
& p == 5
& name ==bob
&时间> = as.POSIXlt(2014-10-27 18:11:36 PDT)
&时间< = as.POSIXlt(2014-10-27 18:12:00 PDT),]

结果

sp名称时间
1:100 5 bob 2014-10-27 18:11:36
2:100 5 bob 2014-10-27 18:11:37
3:100 5 bob 2014-10-27 18:11:38


I am dealing with a data frame with 3 million rows and 10 columns and I am doing some subsetting on it. I have some toy code below. When I subset it takes a long time. If I use data.table and subset on data.table will that be faster? Here is some toy code:

s<-c(100,100,100,800,800,6662,33565,265653262,266532)
p<-c(5,5,5,10,10,10,8,9,10)
name<-c("bob","bob","bob","ed","ed","ed","joe","frank","ted")
time<- as.POSIXct(as.character(c("2014-10-27 18:11:36 PDT","2014-10-27 18:11:37 PDT","2014-10-27 18:11:38 PDT","2014-10-27 18:11:39 PDT","2014-10-27 18:11:40 PDT","2014-10-27 18:11:41 PDT","2014-10-27 19:11:36 PDT","2014-10-27 20:11:36 PDT","2014-10-27 21:11:36 PDT")))
dat<- data.frame(s,p,name,time)
dat

here is the data frame:

          s  p  name                time
1       100  5   bob 2014-10-27 18:11:36
2       100  5   bob 2014-10-27 18:11:37
3       100  5   bob 2014-10-27 18:11:38
4       800 10    ed 2014-10-27 18:11:39
5       800 10    ed 2014-10-27 18:11:40
6      6662 10    ed 2014-10-27 18:11:41
7     33565  8   joe 2014-10-27 19:11:36
8 265653262  9 frank 2014-10-27 20:11:36
9    266532 10   ted 2014-10-27 21:11:36

now I subset on the dataframe:

  result <- subset(dat,    as.numeric(s) == 100
                   &  p == 5
                   &  name  == "bob"
                   & time >= "2014-10-27 18:11:36 PDT"
                   & time <= "2014-10-27 18:12:00 PDT"
                   )
  result

    s p name                time
1 100 5  bob 2014-10-27 18:11:36
2 100 5  bob 2014-10-27 18:11:37
3 100 5  bob 2014-10-27 18:11:38

How can I do something similar using data.table?

Thank you.

解决方案

Well, your example code actually break for data frames thanks to the "time" selectors - you're trying to match POSIXlt dates (in the data frame) with character strings (in the selector). I think you want:

result <- subset(dat,    as.numeric(s) == 100
               &  p == 5
               &  name  == "bob"
               & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
               & time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
               )

result
    s p name                time
1 100 5  bob 2014-10-27 18:11:36
2 100 5  bob 2014-10-27 18:11:37
3 100 5  bob 2014-10-27 18:11:38

This syntax works perfectly well for data.tables:

dat <- as.data.table(dat)
result <- subset(dat,
              as.numeric(s) == 100
              &  p == 5
              &  name  == "bob"
              & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
              & time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
)
result

     s p name                time
1: 100 5  bob 2014-10-27 18:11:36
2: 100 5  bob 2014-10-27 18:11:37
3: 100 5  bob 2014-10-27 18:11:38

If you want something more data.table-like, you can avoid "subset" entirely and instead just operate on the data.table directly:

dat <- as.data.table(dat)
result <- dat[as.numeric(s) == 100
              & p == 5
              & name  == "bob"
              & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
              & time <= as.POSIXlt("2014-10-27 18:12:00 PDT"),]

result 

     s p name                time
1: 100 5  bob 2014-10-27 18:11:36
2: 100 5  bob 2014-10-27 18:11:37
3: 100 5  bob 2014-10-27 18:11:38

这篇关于使用data.table而不是data.frame进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆