用dplyr进行横向操作 [英] rowwise operation with dplyr
问题描述
我正在研究一个2,300万条记录中的大型数据框,其中包含用户在起始和停止时间的位置的交易。我的目标是创建一个新的数据帧,其中包含每个用户/每个位置连接的时间量。让我们称这个小时连接。
交易可以从8分钟到48小时不同,因此目标数据框将为大约1亿的记录,并将每月增长。
下面的代码显示了最终数据帧的开发,尽管总代码非常复杂。在Intel(R)Xeon(R)CPU E5-2630 v3 @ 2.40GHz,16核128GB内存上运行总代码需要9个小时。
library(dplyr)
numsessions <-1000000
startdate< -as。 POSIXlt(runif(numsessions,1,365 * 60 * 60)* 24,origin =2015-1-1)
df.Sessions< -data.frame(userID = round(runif(numsessions, 1,500 $)
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,%w),
start = startdate,
end = startdate + runif(1,1,60 * 60 * 10)
)
dfhourly.connected< -df.Sessions%>%rowwise%>%do (data.frame(userID =。$ userID,
hourlydate = as.Date(seq(。$ start,$ end,by = 60 * 60)),
hournr = format(seq $ start,$ end,by = 60 * 60),%H)
)
)
我们要将这个过程并行化(一些)16个内核,以加快程序。首先尝试使用 multidplyr
包。分区是基于 daynr
df.hourlyconnected< -df .sessions%>%
分区(daynr,cluster = init_cluster(6))%>%
rowwise%>%do(data.frame(userID =。$ userID,
hourlydate = as.Date(seq(。$ start,。$ end,by = 60 * 60)),
hournr = format(seq(。$ start,$ end,by = 60 * 60) %H)
)
)%>%collect()
现在, rowwise
函数似乎需要一个数据帧作为输入,而不是一个分区。
我的问题是
-
有没有解决方法来对每个核心的分区执行横列计算?
-
有没有人有建议用不同的R包和方法来执行这个计算? / p>
(我认为发布此作为答案可能会使未来的读者受益兴趣于高效编码。)
R是一种向量化语言,因此按行进行操作是最昂贵的操作之一;特别是如果您正在评估大量的功能,调度方法,转换类和创建新的数据集。
因此,第一步是减少 em> by 操作。通过查看您的代码,您似乎根据 userID
,开始
扩大数据集的大小, end
- 所有其余的操作都可以追溯到(因此被矢量化)。另外,运行 seq
(它本身不是一个非常有效的函数),两行不添加任何内容。最后,在 POSIXt
类中明确调用 seq.POSIXt
将会节省方法调度的开销。
我不知道如何使用 dplyr
有效地执行此操作,因为 mutate
无法处理它,并且 do
函数(IIRC)总是证明它自身是非常低效的。因此,让我们尝试一下可以轻松处理这个任务的 data.table
包
library(data.table)
res < - setDT(df.Sessions)[,seq.POSIXt(start,end,by = 3600),by =。(userID,start,end)]
再次请注意,我将操作最小化为操作到单个功能呼叫同时避免方法调度
现在我们已经准备好了数据集,我们不需要任何按行操作更多的是,从现在开始,一切都可以向量化。
尽管矢量化不是故事的结尾。我们还需要考虑类转换,方法调度等。例如,我们可以创建 hourlydate
和 hournr
使用不同的 Date
类函数或使用格式
或甚至 substr
。需要考虑的折衷是,例如, substr
将是最快的,但结果将是一个字符
vector而不是一个日期
一 - 由你决定是否喜欢最终产品的速度或质量。有时你可以赢得两者,但首先你应该检查你的选择。让基准3个不同的向量化方法来计算 hournr
变量
library( microbenchmark)
set.seed(123)
N< - 1e5
test < - as.POSIXlt(runif(N,1,1e5),origin =1900-01-01
microbenchmark(format= format(test,%H),
substr= substr(test,12L,13L),
data。表::小时=小时(测试))
#单位:微秒
#expr最小lq平均值uq max neval cld
#格式273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
#substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
#data.table :: hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a
data.table :: hour
是速度和质量的明显赢家(结果是一个整数向量而不是一个字符e),同时通过〜x12,000 的因素提高您以前的解决方案的速度(我甚至没有根据行执行测试)。
现在让我们尝试3种不同的方式来获取 data.table :: hour
microbenchmark(as.Date= as.Date(test),
pre>
substr= substr(test,1L,10L),
data.table :: as。 IDate= as.IDate(test))
#单位:毫秒
#expr min lq平均值uq max neval cld
#as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
#substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
#data.table :: as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
看起来像第一个和第三个选项几乎是一样的速度,而我更喜欢
as.IDate
因为整数
存储模式。
现在我们知道效率和质量在哪里,我们可以通过运行
res [,`:=`(hourlydate = as.IDate(V1),hournr = hour(V1))]
(然后可以轻松删除不必要的列使用类似的语法
res [,yourcolname:= NULL]
,我将留给你)
< hr>
可能更有效的方法来解决这个问题,但这表明了如何使代码更有效率的一种可能方法。
作为附注,如果您想进一步调查
data.table
语法/功能,以下是一个很好的阅读
https://github.com/Rdatatable/data.table/wiki/Getting-started
I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.
Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.
The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.
library(dplyr) numsessions<-1000000 startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1") df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)), postalcode = round(runif(numsessions,1,100)), daynr = format(startdate,"%w"), start =startdate , end= startdate + runif(1,1,60*60*10) ) dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID, hourlydate=as.Date(seq(.$start,.$end,by=60*60)), hournr=format(seq(.$start,.$end,by=60*60),"%H") ) )
We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the
multidplyr
package. The partition is made based ondaynr
df.hourlyconnected<-df.Sessions %>% partition(daynr,cluster=init_cluster(6)) %>% rowwise %>% do(data.frame(userID=.$userID, hourlydate=as.Date(seq(.$start,.$end,by=60*60)), hournr=format(seq(.$start,.$end,by=60*60),"%H") ) ) %>% collect()
Now, the
rowwise
function appears to require a dataframe as input instead of a partition.My questions are
Is there a workaround to perform a rowwise calculation on partitions per core?
Has anyone got a suggestion to perform this calculation with a different R package and methods?
(I think posting this as an answer could benefit future readers who have interest in efficient coding.)
R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.
Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to userID
, start
and end
- all the rest of the operations could come afterwords (and hence be vectorized). Also, running seq
(which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitly seq.POSIXt
on a POSIXt
class will save you the overhead of method dispatching.
I'm not sure how to do this efficiently with dplyr
, because mutate
can't handle it and the do
function (IIRC) always proved it self to be highly inefficient. Hence, let's try the data.table
package that can handle this task easily
library(data.table)
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)]
Again, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch
Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.
Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate
and hournr
using either different Date
class functions or using format
or maybe even substr
. The trade off that needs to be taken in account is that, for instance, substr
will be the fastest, but the result will be a character
vector rather a Date
one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating the hournr
variable
library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")
microbenchmark("format" = format(test, "%H"),
"substr" = substr(test, 12L, 13L),
"data.table::hour" = hour(test))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
# substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
# data.table::hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a
data.table::hour
is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).
Now lets try 3 different ways for data.table::hour
microbenchmark("as.Date" = as.Date(test),
"substr" = substr(test, 1L, 10L),
"data.table::as.IDate" = as.IDate(test))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
# substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
# data.table::as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate
because of the integer
storage mode.
Now that we know where both efficiency and quality lies, we could simply finish the task by running
res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]
(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL]
which I'll leave to you)
There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.
As a side note, if you want further to investigate data.table
syntax/features, here's a good read
https://github.com/Rdatatable/data.table/wiki/Getting-started
这篇关于用dplyr进行横向操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!