并行分析以循环分析data.frame [英] Parallelization over for loop analyzing a data.frame

查看:98
本文介绍了并行分析以循环分析data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这几天,我一直在使用800万个寄存器的data.frame,并且我需要改进一个循环来分析这些数据。

These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data.

我将描述我要解决的问题的每个过程。
首先,我必须按三个字段ClientID,日期和时间以升序排列所有data.frame。 (检查)
然后,使用排列的data.frame,我必须操作每个观察值之间的差异,只有在ClientID相同的情况下,才可以执行此操作。例如:

I will describe each process of the problem that I am trying to solve. First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check) Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example:

ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030

根据数据,我要获取的结果如下:

According to the data up, the result that I want to obtain is the following:

ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30

现在的问题是,使用8M观察的data.frame分析所有这些,大约需要3天。我希望我可以并行化这个过程。我的想法是可以按群集对data.frame进行分段,但是这种分段可以按顺序进行,而不是随机进行,然后使用foreach库或其他库,可以对群集进行分析并将其设置为可用的核心数。例如:

The problem now is that, analyzing all this with a data.frame of 8M observations, it takes like 3 days. I wish I could parallelize this process. My idea is that the data.frame could be segmented by clusters, but this segmentation could be in order and not randomly, and then using the library foreach or another library, could take by clusters the analysis and set it to the number of cores available. For example:

Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030


推荐答案

我不建议尝试并行化它。使用 data.table 包并使用以整数格式存储的时间,这将花费相当少的时间。

I wouldn't recommend trying to parallelize this. Using the data.table package and working with times stored in an integer format this should take a pretty trivial amount of time.

library(data.table)

## Generate Data
RowCount <- 8e6
GroupCount <-1e4

DT <- data.table(ClientID = paste0("Client ",sample.int(GroupCount,size = RowCount, replace = TRUE)),
                 Time = sample.int(12,size = RowCount, replace = TRUE)*900)

DT[, Time := cumsum(Time), keyby = .(ClientID)]
DT[, Time := as.POSIXct(Time, tz = "UTC", origin = "1970-01-01 00:00:00")]

print(DT)

给予

            ClientID                Time
      1:    Client 1 1970-01-01 02:30:00
      2:    Client 1 1970-01-01 04:00:00
      3:    Client 1 1970-01-01 05:30:00
      4:    Client 1 1970-01-01 07:00:00
      5:    Client 1 1970-01-01 10:00:00
     ---                                
7999996: Client 9999 1970-02-20 18:15:00
7999997: Client 9999 1970-02-20 18:30:00
7999998: Client 9999 1970-02-20 21:00:00
7999999: Client 9999 1970-02-20 22:45:00
8000000: Client 9999 1970-02-21 00:30:00



计算时间差



Calculate time differences

system.time({
  ## Create a integer column that stores time as the number of seconds midnight on 1970
  DT[,Time_Unix := as.integer(Time)]

  ## Order by ClientID then Time_Unix
  setkey(DT, ClientID, Time_Unix)

  ## Calculate Elapsed Time in minutes between rows, grouped by ClientID
  DT[, Elapsed_Minutes := (Time_Unix - shift(Time_Unix, n = 1L, type = "lag", fill = NA))/60L, keyby = .(ClientID)]

  ## Clean up the integer time
  DT[,Time_Unix := NULL]
})

...

   user  system elapsed 
  0.416   0.025   0.442 



结果:



Results:

print(DT)

...

            ClientID                Time Elapsed_Minutes
      1:    Client 1 1970-01-01 02:30:00              NA
      2:    Client 1 1970-01-01 04:00:00              90
      3:    Client 1 1970-01-01 05:30:00              90
      4:    Client 1 1970-01-01 07:00:00              90
      5:    Client 1 1970-01-01 10:00:00             180
     ---                                                
7999996: Client 9999 1970-02-20 18:15:00             135
7999997: Client 9999 1970-02-20 18:30:00              15
7999998: Client 9999 1970-02-20 21:00:00             150
7999999: Client 9999 1970-02-20 22:45:00             105
8000000: Client 9999 1970-02-21 00:30:00             105

这篇关于并行分析以循环分析data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆