条件和上的组向量 [英] Group vector on conditional sum

查看:37
本文介绍了条件和上的组向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据小于或等于 n 的元素总和对向量进行分组.假设如下,

I want to group a vector based on the sum of the elements being less than or equal to n. Assume the following,

set.seed(1)
x <- sample(10, 20, replace = TRUE)
#[1]  3  4  6 10  3  9 10  7  7  1  3  2  7  4  8  5  8 10  4  8

#Where,
n = 15

预期输出是对值进行分组,而它们的总和 <= 15,即

The expected output would be to group values while their sum is <= 15, i.e.

y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)

如您所见,总和永远不会大于 15,

As you can see the sum is never greater than 15,

sapply(split(x, y), sum)
# 1  2  3  4  5  6  7  8  9 10 
#13 13  9 10 15 12 12 13 14  8 

注意:我将在庞大的数据集(通常 > 150 - 200GB)上运行它,因此效率是必须的.

NOTE: I will be running this on huge datasets (usually > 150 - 200GB) so efficiency is a must.

我尝试过并接近但失败的方法是,

A method that I tried and comes close but fails is,

as.integer(cut(cumsum(x), breaks = seq(0, max(cumsum(x)) + 15, 15)))
#[1] 1 1 1 2 2 3 3 4 4 4 5 5 5 6 6 6 7 8 8 8

推荐答案

Here is my Rcpp-solution (close to Khashaa 的 解决方案,但有点短/精简),因为您说速度很重要,Rcpp 可能是要走的路:

Here is my Rcpp-solution (close to Khashaa's solution but a bit shorter/stripped down), because you said speed was important, Rcppis probably the way to go:

# create the data
set.seed(1)
x <- sample(10, 20, replace = TRUE)
y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)

# create the Rcpp function
library(Rcpp)
cppFunction('
IntegerVector sotosGroup(NumericVector x, int cutoff) {
 IntegerVector groupVec (x.size());
 int group = 1;
 double runSum = 0;
 for (int i = 0; i < x.size(); i++) {
  runSum += x[i];
  if (runSum > cutoff) {
   group++;
   runSum = x[i];
  }
  groupVec[i] = group;
 }
 return groupVec;
}
')

# use the function as usual
y_cpp <- sotosGroup(x, 15)
sapply(split(x, y_cpp), sum)
#>  1  2  3  4  5  6  7  8  9 10 
#> 13 13  9 10 15 12 12 13 14  8


all.equal(y, y_cpp)
#> [1] TRUE

万一有人需要被速度说服:

In case anyone needs to be convinced by the speed:

# Speed Benchmarks
library(data.table)
library(microbenchmark)
dt <- data.table(x)

frank <- function(DT, n = 15) {
 DT[, xc := cumsum(x)]
 b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
 z = 1; res = z
 while (!is.na(z)) 
  res <- c(res, z <- b[z])
 DT[, g := cumsum(.I %in% res)][]
}

microbenchmark(
 frank(dt),
 sotosGroup(x, 15),
 times = 100
)
#> Unit: microseconds
#>               expr      min       lq       mean    median       uq       max neval cld
#>          frank(dt) 1720.589 1831.320 2148.83096 1878.0725 1981.576 13728.830   100   b
#>  sotosGroup(x, 15)    2.595    3.962    6.47038    7.5035    8.290    11.579   100  a

这篇关于条件和上的组向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆