运行舍入 [英] Running Rounding

查看:46
本文介绍了运行舍入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试以一种舍入值的运行总和与一个组中原始值的运行总和相匹配的方式实现一列的四舍五入.

任务的样本数据包含三列:

  • 数字-我需要四舍五入的值;
  • ids-定义值的顺序,可以是时间序列数据的日期;
  • group-定义我需要四舍五入的组.

这是一个数据示例,已按组中的ID排序:

 数字ID组35.07209 1 127.50931 2 170.62019 3 199.55451 6 134.40472 8 117.58864 10 193.66178 4 383.21700 5 363.89058 7 388.96561 9 3 

要生成用于测试的示例数据,请使用以下代码:

 #制作数据样本.x.size<-10 ^ 6x<-list(数字" = runif(x.size)* 100,"ids" = 1:x.size,"group" = ifelse(runif(x.size)> 0.2,1,ifelse(runif(x.size)> 0.8,2,3)))x<-data.frame(x)x<-x [order(x $ group),] 

我编写了一个函数,该函数将舍入状态保持在组内,以确保舍入值的总值正确:

  makeRunRound<-function(){#数据必须按ID排序.cumDiff<-0savedId<-0函数(x,id){这里的#id代表该组.if(id!= savedId){cumDiff<<--0savedId<<-id}xInt<-地板(x)cumDiff<<--x-xInt + cumDiffif(cumDiff> 1){xInt<-xInt +舍入(cumDiff)cumDiff<<-cumDiff-舍入(cumDiff)}返回(xInt)}}runRound<-makeRunRound() 

这种方法行得通,如果不是为了速度,我会对此感到高兴.

完成1m记录样本的运行舍入需要2-3秒.

这对我来说太长了,

 单位:毫秒expr min lq平均中位数uq max nevalrun.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679 50run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113 50smart.df 488.70854 537.03179 576.57704 567.63077 611.81271 861.76436 50smart.dt 390.35646 414.96749 468.95317 457.85820 507.54395 631.17081 50傻13.72486 15.82744 19.41796 17.19057 18.85385 88.06329 50 

因此,尊重组内四舍五入值的运行总计的方法,速度从单元级四舍五入的20ms变为2.6s.

我已经包括了基于 data.frame data.table 的计算的比较,以证明即使 data也没有重大区别.表会稍微提高性能.

我真的很欣赏 smartRound 的简单性和速度,但是它不尊重项目的顺序,因此结果将与我需要的有所不同.

有没有办法:

  • 或者,以一种不会降低性能的方式修改 smartRound ,使其达到与 runRound 相同的结果吗?
  • 还是修改 runRound 以提高性能?
  • 或者,还有其他更好的解决方案吗?

dww答案给出了最快的解决方案:

  diffRound<-函数(x){diff(c(0,轮(cumsum(x))))} 

我已将测试缩小为四个选项:

  res<-微基准("silly" = x $ silly.round<-round(x $ numbers),"diff(dww)" = smart.round.dt<--x.dt [,.(四舍五入= diffRound(numbers)),by =.(group)],"smart.dt" = smart.round.dt<-x.dt [,.(四舍五入= smartRound(数字)),= =.(组)],"run.dt" = u<-x.dt [,..(四舍五入= runRound(数字,组)),由=.(group,ids)],时间= 50) 

新结果:

 单位:毫秒expr min lq平均中位数uq max neval傻14.67823 16.64882 17.31416 16.83338 17.67497 22.48689 50diff(dww)54.57762 70.11553 76.67135 71.37325 76.83717 139.18745 50smart.dt 392.83240 408.65768 456.46592 441.33212 492.67824 592.57723 50run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005 50 

感谢dww,我的性能提高了6倍,而不会降低精度.

解决方案

我将通过简单的基本矢量化函数来做到这一点:

首先计算原始数字的累加总数,并计算该累加总数的舍入值.然后,使用diff()找到一个将这些四舍五入后的总数相加的数字列表,以查看每个四舍五入后的总和如何大于最后一个.

  cum.sum<-cumsum(x $数字)累积总和<-舍入(总和)数字舍入<-diff(总和舍入)Numbers.round<-c(cum.sum.rounded [1],numbers.round) 

检查所有内容是否如您所愿:

  check.cs<-累计(numbers.round)all(abs(check.cs-cum.sum)< = 1)#真的 

I am trying to implement rounding over a column in a way that running sum of rounded values matches the running sum of original values within a group.

Sample data for the task has three columns:

  • numbers - values that I need to round;
  • ids - define order of values, can be date for time series data;
  • group - defines the group within which I need to round the numbers.

Here is a data sample, already ordered by ids within a group:

       numbers  ids group
       35.07209 1   1
       27.50931 2   1
       70.62019 3   1
       99.55451 6   1
       34.40472 8   1
       17.58864 10  1
       93.66178 4   3
       83.21700 5   3
       63.89058 7   3
       88.96561 9   3

To generate sample data for testing I use this code:

  # Make data sample.
  x.size <- 10^6
  x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
  x<- data.frame(x)
  x <- x[order(x$group), ]

I wrote a function that keeps the state of rounding within a group, to make sure that the total value of round values is correct:

makeRunRound <- function() {
  # Data must be sorted by id.
  cumDiff <- 0
  savedId <- 0

  function(x, id) {
  # id here represents the group.

    if(id != savedId) {
      cumDiff <<- 0
      savedId <<- id
    }

    xInt <- floor(x)
    cumDiff <<- x - xInt + cumDiff

    if(cumDiff > 1) {
      xInt <- xInt + round(cumDiff)
      cumDiff <<- cumDiff - round(cumDiff)
    }
    return (xInt)
  }
}

runRound <- makeRunRound()

This approach works and I would be happy about it if not for the speed.

It takes 2-3 second to complete running rounding on a 1m records sample.

This is too long for me and there is another way explained in this question which works six times faster. I keep the code as given in the answer by josliber:

smartRound <- function(x) {
  y <- floor(x)
  indices <- tail(order(x-y), round(sum(x)) - sum(y))
  y[indices] <- y[indices] + 1
  y
}

Using the sample data generated by the code above, benchmarking:

# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
  "run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  "smart.df" = x$smart.round <- smartRound(x$numbers),
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "silly" = x$silly.round <- round(x$numbers),
  times = 50
)
print(res)
boxplot(res)

, produces these results:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
   run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679    50
   run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113    50
 smart.df  488.70854  537.03179  576.57704  567.63077  611.81271  861.76436    50
 smart.dt  390.35646  414.96749  468.95317  457.85820  507.54395  631.17081    50
    silly   13.72486   15.82744   19.41796   17.19057   18.85385   88.06329    50

So, speed changes from 20ms for the cell level rounding to 2.6s for the method that respects the running total of rounded values within the group.

I have included comparison of the calculations based on the data.frame and data.table to demonstrate that there is no major difference, even though data.table slightly improves performance.

I really appreciate the simplicity and the speed of the smartRound, but it does not respect the order of the items, hence result will different from what I need.

Is there a way to:

  • either, modify smartRound in a way that will achieve the same results as runRound without loosing the performance?
  • or, modify runRound to improve performance?
  • or, is there another better solution all together?

EDIT:

dww answer gives the fastest solution:

diffRound <- function(x) { 
  diff(c(0, round(cumsum(x)))) 
}

I have reduced the test to four options:

res <- microbenchmark(
  "silly" = x$silly.round <- round(x$numbers),
  "diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  times = 50
)

New results:

Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval
     silly   14.67823   16.64882   17.31416   16.83338   17.67497   22.48689    50
 diff(dww)   54.57762   70.11553   76.67135   71.37325   76.83717  139.18745    50
  smart.dt  392.83240  408.65768  456.46592  441.33212  492.67824  592.57723    50
    run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005    50

Thanks to dww, I have 6x performance gain without loosing the precision.

解决方案

I would do it this way, with simple base vectorised functions:

first calculate the running total of the original numbers, and the rounded value of that running total. Then find a list of numbers that add up to this rounded running total using diff() to see how each rounded sum is larger than the last.

cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)

Check that all is as you want it:

check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE

这篇关于运行舍入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆