用dplyr进行横向操作 [英] rowwise operation with dplyr

查看:129
本文介绍了用dplyr进行横向操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个2,300万条记录中的大型数据框,其中包含用户在起始和停止时间的位置的交易。我的目标是创建一个新的数据帧,其中包含每个用户/每个位置连接的时间量。让我们称这个小时连接。



交易可以从8分钟到48小时不同,因此目标数据框将为大约1亿的记录,并将每月增长。



下面的代码显示了最终数据帧的开发,尽管总代码非常复杂。在Intel(R)Xeon(R)CPU E5-2630 v3 @ 2.40GHz,16核128GB内存上运行总代码需要9个小时。

  library(dplyr)

numsessions <-1000000
startdate< -as。 POSIXlt(runif(numsessions,1,365 * 60 * 60)* 24,origin =2015-1-1)

df.Sessions< -data.frame(userID = round(runif(numsessions, 1,500 $)
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,%w),
start = startdate,
end = startdate + runif(1,1,60 * 60 * 10)



dfhourly.connected< -df.Sessions%>%rowwise%>%do (data.frame(userID =。$ userID,
hourlydate = as.Date(seq(。$ start,$ end,by = 60 * 60)),
hournr = format(seq $ start,$ end,by = 60 * 60),%H)


我们要将这个过程并行化(一些)16个内核,以加快程序。首先尝试使用 multidplyr 包。分区是基于 daynr

  df.hourlyconnected< -df .sessions%>%
分区(daynr,cluster = init_cluster(6))%>%
rowwise%>%do(data.frame(userID =。$ userID,
hourlydate = as.Date(seq(。$ start,。$ end,by = 60 * 60)),
hournr = format(seq(。$ start,$ end,by = 60 * 60) %H)

)%>%collect()

现在, rowwise 函数似乎需要一个数据帧作为输入,而不是一个分区。



我的问题是




  • 有没有解决方法来对每个核心的分区执行横列计算?


  • 有没有人有建议用不同的R包和方法来执行这个计算? / p>



解决方案

(我认为发布此作为答案可能会使未来的读者受益兴趣于高效编码。)






R是一种向量化语言,因此按行进行操作是最昂贵的操作之一;特别是如果您正在评估大量的功能,调度方法,转换类和创建新的数据集。



因此,第一步是减少 em> by 操作。通过查看您的代码,您似乎根据 userID 开始扩大数据集的大小, end - 所有其余的操作都可以追溯到(因此被矢量化)。另外,运行 seq (它本身不是一个非常有效的函数),两行不添加任何内容。最后,在 POSIXt 类中明确调用 seq.POSIXt 将会节省方法调度的开销。



我不知道如何使用 dplyr 有效地执行此操作,因为 mutate 无法处理它,并且 do 函数(IIRC)总是证明它自身是非常低效的。因此,让我们尝试一下可以轻松处理这个任务的 data.table

  library(data.table)
res < - setDT(df.Sessions)[,seq.POSIXt(start,end,by = 3600),by =。(userID,start,end)]

再次请注意,我将操作最小化为操作到单个功能呼叫同时避免方法调度






现在我们已经准备好了数据集,我们不需要任何按行操作更多的是,从现在开始,一切都可以向量化。



尽管矢量化不是故事的结尾。我们还需要考虑类转换,方法调度等。例如,我们可以创建 hourlydate hournr 使用不同的 Date 类函数或使用格式或甚至 substr 。需要考虑的折衷是,例如, substr 将是最快的,但结果将是一个字符 vector而不是一个日期一 - 由你决定是否喜欢最终产品的速度或质量。有时你可以赢得两者,但首先你应该检查你的选择。让基准3个不同的向量化方法来计算 hournr 变量

  library( microbenchmark)
set.seed(123)
N< - 1e5
test < - as.POSIXlt(runif(N,1,1e5),origin =1900-01-01

microbenchmark(format= format(test,%H),
substr= substr(test,12L,13L),
data。表::小时=小时(测试))

#单位:微秒
#expr最小lq平均值uq max neval cld
#格式273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
#substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
#data.table :: hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a

data.table :: hour 是速度和质量的明显赢家(结果是一个整数向量而不是一个字符e),同时通过〜x12,000 的因素提高您以前的解决方案的速度(我甚至没有根据行执行测试)。



现在让我们尝试3种不同的方式来获取 data.table :: hour

  microbenchmark(as.Date= as.Date(test),
substr= substr(test,1L,10L),
data.table :: as。 IDate= as.IDate(test))

#单位:毫秒
#expr min lq平均值uq max neval cld
#as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
#substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
#data.table :: as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
pre>

看起来像第一个和第三个选项几乎是一样的速度,而我更喜欢 as.IDate 因为整数存储模式。






现在我们知道效率和质量在哪里,我们可以通过运行

  res [,`:=`(hourlydate = as.IDate(V1),hournr = hour(V1))] 

(然后可以轻松删除不必要的列使用类似的语法 res [,yourcolname:= NULL] ,我将留给你)



< hr>

可能更有效的方法来解决这个问题,但这表明了如何使代码更有效率的一种可能方法。



作为附注,如果您想进一步调查 data.table 语法/功能,以下是一个很好的阅读



https://github.com/Rdatatable/data.table/wiki/Getting-started


I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.

Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.

The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.

library(dplyr)

numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")

df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
           postalcode = round(runif(numsessions,1,100)),
           daynr = format(startdate,"%w"),
              start =startdate ,
              end=   startdate + runif(1,1,60*60*10)
           )


dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
                                          hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                                          hournr=format(seq(.$start,.$end,by=60*60),"%H")
                                          )
                               )

We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the multidplyr package. The partition is made based on daynr

df.hourlyconnected<-df.Sessions %>% 
                      partition(daynr,cluster=init_cluster(6)) %>%
                      rowwise %>% do(data.frame(userID=.$userID,
                            hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                            hournr=format(seq(.$start,.$end,by=60*60),"%H")
                              )
                            ) %>% collect()

Now, the rowwise function appears to require a dataframe as input instead of a partition.

My questions are

  • Is there a workaround to perform a rowwise calculation on partitions per core?

  • Has anyone got a suggestion to perform this calculation with a different R package and methods?

解决方案

(I think posting this as an answer could benefit future readers who have interest in efficient coding.)


R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.

Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to userID, start and end - all the rest of the operations could come afterwords (and hence be vectorized). Also, running seq (which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitly seq.POSIXt on a POSIXt class will save you the overhead of method dispatching.

I'm not sure how to do this efficiently with dplyr, because mutate can't handle it and the do function (IIRC) always proved it self to be highly inefficient. Hence, let's try the data.table package that can handle this task easily

library(data.table) 
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)] 

Again, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch


Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.

Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate and hournr using either different Date class functions or using format or maybe even substr. The trade off that needs to be taken in account is that, for instance, substr will be the fastest, but the result will be a character vector rather a Date one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating the hournr variable

library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")

microbenchmark("format" = format(test, "%H"),
               "substr" = substr(test, 12L, 13L),
               "data.table::hour" = hour(test))

# Unit: microseconds
#             expr        min         lq        mean    median        uq       max neval cld
#           format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88   100  b 
#           substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27   100   c
# data.table::hour      5.121      7.681     23.9746     27.84     33.44     55.36   100 a  

data.table::hour is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).

Now lets try 3 different ways for data.table::hour

microbenchmark("as.Date" = as.Date(test), 
               "substr" = substr(test, 1L, 10L),
               "data.table::as.IDate" = as.IDate(test))

# Unit: milliseconds
#                 expr       min        lq      mean    median        uq       max neval cld
#              as.Date  19.56285  20.09563  23.77035  20.63049  21.16888  50.04565   100  a 
#               substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895   100   b
# data.table::as.IDate  19.91964  20.44250  27.50989  21.34551  31.79939 145.65133   100  a 

Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate because of the integer storage mode.


Now that we know where both efficiency and quality lies, we could simply finish the task by running

res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]

(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL] which I'll leave to you)


There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.

As a side note, if you want further to investigate data.table syntax/features, here's a good read

https://github.com/Rdatatable/data.table/wiki/Getting-started

这篇关于用dplyr进行横向操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆