Gpu处理R(如何使用Gpu处理在数据集的子集上运行函数) [英] Gpu processing R (How to use Gpu processing to run a function on subsets of a dataset)

查看:547
本文介绍了Gpu处理R(如何使用Gpu处理在数据集的子集上运行函数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据集(大约500万个观测值).观察记录通过类型"表示的不同类型的子事件记录来自特定事件的总收入.数据的小复制如下:

I have a large dataset (around 5 million observations). The observations record the total revenue from a specific event by different type of subevents denoted by "type". A small replication of the data is below:

Event_ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Type=c("A","B","C","D","E","A","B","C","D","E","A","B","C","D")
Revenue1=c(24,9,51,7,22,15,86,66,0,57,44,93,34,37)
Revenue2=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(Event_ID,Type,Revenue1,Revenue2)

我想使用GPU内核来运行我编写的功能(我从未尝试过GPU处理,因此完全不知如何开始).实际功能需要很长时间才能运行.我在下面显示了一个非常简单的函数版本:

Total_Revenue=function(data){
  full_list=list()
  event_list=unique(data[,'Event_ID'])
  for (event in event_list){
    new_data=list()
    event_data = data[which(data$Event_ID==event),]
    for (i in 1:nrow(event_data)){
      event_data[i,'Total_Rev'] = event_data[i,'Revenue1']+event_data[i,'Revenue2'] 
      new_data=rbind(new_data,event_data[i,])
    }
  full_list=rbind(full_list,new_data)
  }
  return(full_list)
}

Total = Total_Revenue(data=z)
print(Total)

此简化版本的功能如下:

This simplified version function proceeds as follows:

a)将数据集分解为子集,以便每个子集仅发生1个唯一事件.

a) Break up the dataset into subsets such that each subset only takes 1 unique event.

b)对于每个观察值,循环浏览所有观察值,并计算出Revenue1 + Revenue2.

b)For each observation, loop through all the observations and compute Revenue1+Revenue2.

c)存储子集,最后返回新数据集.

c)Store the subsets and at the end return the new dataset.

由于我之前没有经验,因此我正在研究某些R软件包.我找到并安装了gpuR软件包..但是,我很难理解如何实现它.还有一个问题是我的编码背景非常薄弱.在过去的一年中,我自学了一些东西.

Since I have no prior experience, I was looking at some of the R packages. I found the gpuR package and installed it. However, I am having difficulty in understanding how to implement this. Also the issue is that my coding background is very weak. I have self taught myself some things over the past year.

任何帮助/线索都将受到高度赞赏.我也愿意使用其他替代软件包.请让我知道是否错过了任何事情.

Any help/leads will be highly appreciated. I am open to using any alternate packages as well. Please let me know if I missed anything.

P.S.我还使用以下命令拍摄了系统快照:

P.S. I also took a snapshot of my system using the following command:

str(gpuInfo())

我附上输出内容供您参考:

I am attaching the output for your reference:

P.P.S.请注意,我的实际功能有些复杂且很长,并且运行时间很长,这就是为什么我要在此处实现gpu处理.

P.P.S. Please note that my actual function is a little complicated and long and it takes a long time to run which is why I want to implement gpu processing here.

推荐答案

GPU编程不是灵丹妙药.它仅对某些问题有效.这就是gpuR包提供GPU基本向量和矩阵的原因,从而允许使用GPU进行线性代数运算.如果您的问题不是线性代数问题,这将无济于事.但是,请注意,可以通过这种方式制定许多问题.

GPU programming is no silver bullet. It works well only for certain problems. That's why the gpuR package provides GPU base vectors and matrices allowing for linear algebra operations to be done using the GPU. This won't help you if your problem is no a linear algebra problem. However, note that many problems can be formulated as in such a way.

由于您(可能)过度简化了代码,因此我们无法确定您的问题是否属于此类:

We cannot tell if your problem falls into this category, since you have (probably) over-simplfied your code:

> print(Total)
   Event_ID Type Revenue1 Revenue2 Total_Rev
1         1    A       24       16        40
2         1    B        9       93       102
3         1    C       51       96       147
4         1    D        7       44        51
5         1    E       22       67        89
6         2    A       15       73        88
7         2    B       86       12        98
8         2    C       66       65       131
9         2    D        0       81        81
10        2    E       57       22        79
11        3    A       44       39        83
12        3    B       93       94       187
13        3    C       34       41        75
14        3    D       37       30        67

由于Total_Rev只是Revenue1Revenue2的和,所以您可以更轻松地完成此操作:

Since Total_Rev is just the sum of Revenue1 and Revenue2, you could have done this more easily:

> z$Total_Rev <- z$Revenue1 + z$Revenue2
> z
   Event_ID Type Revenue1 Revenue2 Total_Rev
1         1    A       24       16        40
2         1    B        9       93       102
3         1    C       51       96       147
4         1    D        7       44        51
5         1    E       22       67        89
6         2    A       15       73        88
7         2    B       86       12        98
8         2    C       66       65       131
9         2    D        0       81        81
10        2    E       57       22        79
11        3    A       44       39        83
12        3    B       93       94       187
13        3    C       34       41        75
14        3    D       37       30        67

这是向量化的一种简单形式,可帮助您摆脱(某些)for循环.而且,由于外部for循环着眼于不同的Event_ID,因此研究分组和聚合技术也可能很有意义.这些可以使用R,data.table软件包,tidyverse/dplyr以及其他工具来完成.我使用的是后一种方法,因为我最喜欢它的语法,所以它是最新手的.但是,如果您有大量数据集,则data.table可能是适合您的工具.因此,这里是一个非常简单的汇总,用于计算每个Event_ID的平均值:

This is a simple form of vectorization, which helps you getting rid of (some) for loops. And since you outer for loop looks at different Event_ID, it might also make sense to look into grouping and aggregation techniques. These can be done with base R, the data.table package, with tidyverse/dplyr and possibly other tools. I am using the latter approach, since I fond its syntax the most newbie friendly. However, data.table might be the right tool for you if you have large data sets. So here a very simple aggregation that computes the average per Event_ID:

Event_ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Type=c("A","B","C","D","E","A","B","C","D","E","A","B","C","D")
Revenue1=c(24,9,51,7,22,15,86,66,0,57,44,93,34,37)
Revenue2=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(Event_ID,Type,Revenue1,Revenue2)

library(dplyr)
z %>%
  mutate(Total_Rev = Revenue1 + Revenue2) %>%
  group_by(Event_ID) %>%
  summarise(average = mean(Total_Rev))
#> # A tibble: 3 x 2
#>   Event_ID average
#>      <dbl>   <dbl>
#> 1        1    85.8
#> 2        2    95.4
#> 3        3   103

这篇关于Gpu处理R(如何使用Gpu处理在数据集的子集上运行函数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆