使用更快的hist()或findInterval()获得与cut()相同的输出? [英] Getting same output as cut() using speedier hist() or findInterval()?

查看:74
本文介绍了使用更快的hist()或findInterval()获得与cut()相同的输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了这篇文章 http://www.r -bloggers.com/comparing-hist-and-cut-r-functions/并在我的PC上测试了hist()cut()快约4倍.我的脚本通过cut()循环了很多次,因此节省的时间非常可观.因此,我试图切换到更快的功能,但是在获取cut()的确切输出时遇到了困难.

I read this article http://www.r-bloggers.com/comparing-hist-and-cut-r-functions/ and tested hist() to be faster than cut() by ~4 times on my PC. My script loops through cut() many times so the time-saving would be significant. I thus tried to switch to the speedier function but am having difficulties getting the exact output as per cut().

从下面的示例代码中:

data <- rnorm(10, mean=0, sd=1)  #generate data
my_breaks <- seq(-6, 6, by=1)  #create a vector that specifies my break points
cut(data, breaks=my_breaks)

我希望得到一个向量,该向量包含使用断点分配给每个数据元素的级别,即cut的确切输出:

I wish to get a vector comprising levels that each element of data is assigned to using my breakpoints, i.e. the exact output of cut:

 [1] (1,2]   (-1,0]  (0,1]   (1,2]   (0,1]   (-1,0]  (-1,0]  (0,1]   (-2,-1] (0,1]  
Levels: (-6,-5] (-5,-4] (-4,-3] (-3,-2] (-2,-1] (-1,0] (0,1] (1,2] (2,3] (3,4] (4,5] (5,6]
> 

我的问题:如何使用hist()输出的元素(即中断,计数,密度,中点等)或findInterval达到我的目标?

My question: How do I use elements of the hist() output (i.e. breaks, counts, density, mids, etc) or findInterval to reach my objective?

我分别从 https://stackoverflow.com/questions/12379128/找到了一个示例使用findInterval进行比较的r-switch-statement-on-statement ,但这需要我事先创建间隔标签,这不是我想要的.

Separately, I found an example from https://stackoverflow.com/questions/12379128/r-switch-statement-on-comparisons using findInterval, but this requires me to create the interval labels beforehand, which is not what I want.

任何帮助将不胜感激.预先感谢!

Any help would be appreciated. Thanks in advance!

推荐答案

这是基于您的findInterval建议的实现,比经典cut快5-6倍:

Here is an implementation based on your findInterval suggestion which is 5-6 times faster than classical cut:

cut2 <- function(x, breaks) {
  labels <- paste0("(",  breaks[-length(breaks)], ",", breaks[-1L], "]")
  return(factor(labels[findInterval(x, breaks)], levels=labels))
}

library(microbenchmark)

set.seed(1)
data <- rnorm(1e4, mean=0, sd=1)

microbenchmark(cut.default(data, my_breaks), cut2(data, my_breaks))

# Unit: microseconds
#                         expr      min        lq    median        uq      max neval
# cut.default(data, my_breaks) 3011.932 3031.1705 3046.5245 3075.3085 4119.147   100
#        cut2(data, my_breaks)  453.761  459.8045  464.0755  469.4605 1462.020   100

identical(cut(data, my_breaks), cut2(data, my_breaks))
# TRUE

这篇关于使用更快的hist()或findInterval()获得与cut()相同的输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆