使用Rcpp和R函数将功能应用于多个组 [英] Apply function to multiple groups using Rcpp and R function

查看:112
本文介绍了使用Rcpp和R函数将功能应用于多个组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用foreach包将函数应用于r中的多个组/标识.通过%dopar%使用并行处理要花很多时间,所以我想知道是否有可能通过rcpp或其他软件包运行applyc++中的循环部分以使其更快.我对c++或其他可以执行此操作的软件包不熟悉,所以我希望了解是否可以这样做.示例代码如下.我的实际功能更长,有20多个输入,并且运行时间比我发布的时间还要长

I'm trying to apply a function to multiple groups/id's in r using the foreach package. It's taking forever to run using parallel processing via %dopar%, so I was wondering if it's possible to run the apply or for loop portion in c++ via rcpp or other packages to make it faster. I'm not familiar with c++ or other packages that can do this so I'm hoping to learn if this is possible. The sample code is below. My actual function is longer with over 20 inputs and takes even longer to run than what I'm posting

感谢您的帮助.

我意识到最初的问题很模糊,所以我会尝试做得更好.我有一个表,其中包含按组列出的时间序列数据.每组有> 10K行.我已经通过rcppc++中编写了一个函数,该函数按组过滤表并应用函数.我想遍历唯一的组,并像rbind一样使用rcpp合并结果,以使其运行更快.请参见下面的示例代码(我的实际功能更长)

I realized my initial question was vague so I'll try to do a better job. I have a table with time series data by group. Each group has > 10K rows. I have written a function in c++ via rcpp that filters the table by group and applies a function. I would like to loop through the unique groups and combine the results like rbind does using rcpp so that it runs faster. See sample code below (my actual function is longer)

library(data.table)
library(inline)
library(Rcpp)
library(stringi)
library(Runuran)

# Fake data
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
                                                   pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))

df <- DT[order(Group)][
  , .(Month = seq(1, 180, 1),
      Col1 = urnorm(180, mean = 500, sd = 1, lb = 5, ub = 1000), 
      Col2 = urnorm(180, mean = 1000, sd = 1, lb = 5, ub = 1000), 
      Col3 = urnorm(180, mean = 300, sd = 1, lb = 5, ub = 1000)), 
  by = Group
  ]

# Rcpp function
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
DataFrame testFunc(DataFrame df, StringVector ids, double var1, double var2) {

  // Filter by group
  using namespace std;  
  StringVector sub = df["Group"];
  std::string level = Rcpp::as<std::string>(ids[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
    ind[i] = (sub[i] == level);
  }

  // Access the columns
  CharacterVector Group = df["Group"];
  DoubleVector Month = df["Month"];
  DoubleVector Col1 = df["Col1"];
  DoubleVector Col2 = df["Col2"];
  DoubleVector Col3 = df["Col3"];


  // Create calculations
  DoubleVector Cola = Col1 * (var1 * var2);
  DoubleVector Colb = Col2 * (var1 * var2);
  DoubleVector Colc = Col3 * (var1 * var2);
  DoubleVector Cold = (Cola + Colb + Colc);

  // Result summary
  std::string Group_ID = level;
  double SumCol1 = sum(Col1);
  double SumCol2 = sum(Col2);
  double SumCol3 = sum(Col3);
  double SumColAll = sum(Cold);

  // return a new data frame
  return DataFrame::create(_["Group_ID"]= Group_ID, _["SumCol1"]= SumCol1,
                            _["SumCol2"]= SumCol2, _["SumCol3"]= SumCol3, _["SumColAll"]= SumColAll);
}

# Test function
Rcpp::sourceCpp('sample.cpp')
testFunc(df, ids = "BFTHU1315C", var1 = 24, var2 = 76) # ideally I would like to loop through all groups (unique(df$Group))

#     Group_ID  SumCol1 SumCol2  SumCol3  SumColAll
# 1 BFTHU1315C 899994.6 1798561 540001.6 5907129174

谢谢.

推荐答案

我建议重新考虑我们的方法.我认为您的测试数据集与实际数据集相当,具有3e8行.我估计大约有10 GB的数据.您似乎会对这些数据执行以下操作:

I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:

  • 确定唯一ID的列表(大约5e5)
  • 每个唯一ID创建一个任务
  • 这些任务中的每一项都将获取完整的数据集,并过滤掉所有不属于相关ID的数据
  • 每一项任务都添加了一些不依赖于ID的其他列
  • 每个任务都执行group_b(ID),但是数据集中只剩下一个ID
  • 每个任务都计算出一些简单的方法
  • Determine the list of unique IDs (about 5e5)
  • Create one task per unique ID
  • Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
  • Each of these tasks adds some additional columns that do not depend on the ID
  • Each of the tasks does a group_b(ID), but there is only one ID left in the data set
  • Each of the tasks calculates some simple means

对我来说,这似乎效率很低.内存使用情况.一般来说,对于此类问题,您将需要共享内存并行性",但是foreach仅提供进程并行性".进程并行性的缺点是它增加了内存成本.

To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.

此外,您将丢弃存在于基本R/dplyr/data.table/SQL引擎/中的所有分组和聚合代码.改进这些现有代码库.

In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.

我的建议:

  • 忘记进程并行性"(暂时)
  • 如果您有足够的RAM,请尝试使用具有mutate/group_by/summarize的简单dplyr管道.
  • 如果这还不够快,请了解聚合如何与data.table一起使用,众所周知,它更快,并且可以通过OpenMP提供共享内存并行处理".
  • 如果您的计算机没有足够的内存并且正在交换内存,请研究内存不足计算的可能性.我个人将使用(嵌入式)数据库.
  • Forget about "process parallelism" (for now)
  • If you have sufficient RAM, try with a simple dplyr pipe with mutate / group_by / summarize.
  • If that is not fast enough, learn how aggregation works with data.table, which is known to be faster and offers "shared memory paralleism" via OpenMP.
  • If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.

使其更加明确.这里是仅data.table的解决方案:

To make this more explicit. Here a data.table only solution:

library(data.table)
library(stringi)

# Fake data
set.seed(42)
var1 <- 24
var2 <- 76

DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
                                                 pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)

df <- DT[order(Group)][
  , .(Month = seq(1, 180, 1),
      Col1 = rnorm(180, mean = 500, sd = 1), 
      Col2 = rnorm(180, mean = 1000, sd = 1), 
      Col3 = rnorm(180, mean = 300, sd = 1)), 
  by = Group
  ][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2), 
                                     Col2 * (var1 * var2),
                                     Col3 * (var1 * var2))
    ][, Cold := Cola + Colb + Colc]


# aggregagation
df[, .(SumCol1 = sum(Col1),
       SumCol2 = sum(Col2),
       SumCol3 = sum(Col3),
       SumColAll = sum(Cold)), by = Group]

我正在按引用添加计算列.聚合步骤使用data.table提供的分组功能.如果您的汇总更为复杂,则还可以使用以下函数:

I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table. In case your aggregation is more complicated, you can also use a function:

# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
  list(SumCol1 = sum(Col1),
       SumCol2 = sum(Col2),
       SumCol3 = sum(Col3),
       SumColAll = sum(Cold))
}

df[, mySum(Col1, Col2, Col3, Cold), by = Group]

如果在使用C ++时聚合可能更快(sum之类的情况并非如此),您甚至可以使用:

And if the aggregation might be faster when using C++ (not the case for things like sum!), you can even use that:

# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1, 
                 Rcpp::NumericVector Col2, 
                 Rcpp::NumericVector Col3, 
                 Rcpp::NumericVector Cold) {
    double SumCol1 = Rcpp::sum(Col1);
    double SumCol2 = Rcpp::sum(Col2);
    double SumCol3 = Rcpp::sum(Col3);
    double SumColAll = Rcpp::sum(Cold);             
    return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
                              Rcpp::Named("SumCol2") = SumCol2,
                              Rcpp::Named("SumCol3") = SumCol3,
                              Rcpp::Named("SumColAll") = SumColAll);
}
')

df[, mySum(Col1, Col2, Col3, Cold), by = Group]

在所有这些示例中,摸索和循环都留在data.table上,因为您自己这样做不会获得任何收益.

In all these examples the groping and looping is left to data.table, since you won't gain anything by doing this yourself.

这篇关于使用Rcpp和R函数将功能应用于多个组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆