计算一定范围内的行数 [英] Count number of rows within certain range

查看：37 发布时间：2022/1/24 20:17:08 r dataframe performance data.table range

本文介绍了计算一定范围内的行数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含一些值('value')、下限('min_val')和上限('max_val')的 data.table:

<代码> |价值 |min_val |最大值 |1:|94.001 |94.00 |94.02 |2:|94.002 |94.00 |94.03 |3:|94.003 |94.01 |94.04 |4:|95 |94.98 |95.02 |

我想计算 value > 的行数min_val &价值<整个 dt 中的值的每一行的 max_val.

<代码> |价值 |min_val |最大值 |计数 |1:|94.001 |94.00 |94.02 |1 |#(值的数量 > 94.00 & <94.02)2:|94.002 |94.00 |94.03 |2 |3:|94.003 |94.01 |94.04 |2 |4:|95 |94.98 |95.02 |1 |

我试过了dt[, count := nrow(dt[value > dt[min_val] & value < dt[max_val]])] 但我走错了路.

解决方案

请注意对数时间刻度.

图表表明

Alexandre 的方法比任何其他解决方案都要慢一个数量级(并且可能会在进一步的运行中省略)，
随着 n 行数的增加，henrik 的方法与 r2evans1 并驾齐驱(值得进一步研究)，
每个区间 m 中的值的数量似乎对运行时间没有影响或影响很小.

后者可以通过改变 facet 并在一个 facet 中绘制不同 m 的中值时间来验证:

ggplot(bm1) +aes(x = 中值，y = 表达式，颜色 = as.factor(m)) +geom_point() +facet_wrap(vars(n))

在下面的下一个图表中，绘制 mem_alloc 而不是 median 时间表明

m 对内存分配没有影响(有一个例外)，
对于较大的 n，henrik 的方法需要的内存比任何其他方法都少:

请注意对数刻度.

第二组基准测试

基于之前的结果，下一组基准运行仅改变大小参数 n 而 m 保持不变.Alexandre 的方法因为太慢而被省略.

n 从 2^10 (1024) 变为 2^14 (16384)，其中 m = 1.0.不幸的是，由于我的电脑内存不足，使用 n = 2^15 的运行被中止.

自动绘图(bm2)

henrik 的方法在 2^14 (16384) 行案例的速度方面处于领先地位.

为了确定这是否表明趋势，运行时间与问题大小的关系n由

绘制
ggplot(bm2) +aes(x = n, y = 中值, color = expression, group = attr(expression, "description"),label = ifelse(n == max(n), attr(表达式, 描述"), ")) +geom_point() +geom_smooth(method = "lm", se = FALSE) +scale_x_continuous(trans = scales::log2_trans(), expand = expansion(mult = c(0.05, 0.1))) +ggrepel::geom_label_repel(direction = "y", nudge_x = 1000)
henrik 的方法似乎有很高的开销，但随着问题规模的增加，获得了速度优势.
同样在内存分配方面，henrik 的方法在非等自连接中聚合似乎比其他方法需要的内存要少得多.更重要的是，内存分配随问题大小的增加不那么陡峭，这表明当可用计算机内存是一个限制因素时，这种方法可以处理更大的问题.
第三组基准测试
这组基准运行比较 
n = 2^18 (262144) 的结果:
setDT(bm4)[n == 2^18, .(expression = unique(attr(expression, "description")),n，中位数，mem_alloc)]
<块引用>
 表达式 n 中位数 mem_alloc<字符><num><bench_time><bench_bytes>1:亨里克 262144 17.47s 12.14MB2:劲松262144 1.03m 2.06MB
显然，对于小于等于 2^16 (65536) 的问题，chinsoon 的方法更快，而 henrik 的方法对于更大的问题更快(并且似乎具有更线性的时间行为).对于问题大小 n = 2^18，henrik 的方法几乎是 chinsoon 的 4 倍.
另一方面，henrik 的方法分配的内存比 chinsoon 的要多得多.对于问题大小 n = 2^18，henrik 的方法分配的内存是 chinsoon 的大约 6 倍.显然，随着问题规模的增加，这个比率是恒定的.
因此，速度(henrik 的方法)和内存需求(chinsoon 的方法)之间存在权衡，具体取决于问题的大小.您的里程可能会有所不同.
I have a data.table with some values ('value'), lower limits ('min_val') and upper limits ('max_val'):
   | value | min_val | max_val |
1: | 94.001 | 94.00 | 94.02 |
2: | 94.002 | 94.00 | 94.03 |
3: | 94.003 | 94.01 | 94.04 |
4: | 95 | 94.98 | 95.02 |
I want to calculate the count of rows where value > min_val & value < max_val for each line for the values from the whole dt.
   | value | min_val | max_val | count |
1: | 94.001 | 94.00 | 94.02 |  1       |   #(num of value(s) > 94.00 &  < 94.02)
2: | 94.002 | 94.00 | 94.03 |  2       |
3: | 94.003 | 94.01 | 94.04 |  2       |
4: | 95     | 94.98 | 95.02 |  1       |
I've tried
dt[, count := nrow(dt[value > dt[min_val] & value < dt[max_val]])] but I'm on the wrong path.
 解决方案 
The OP has disclosed that his production dataset consists of 81 million rows. Unfortunately,  r2evans' benchmark used only the 4 rows sample dataset provided with the question and has neglected henrik's suggestion. To find the best approach for a large dataset like OP's one I find it worthwhile to run benchmarks with varying and realistic problem sizes in order to measure run time as well as memory consumption.
Run time and memory consumption may be depend on

the number of values,
the number of intervals,
and the number of values which fall into each interval.
Items 1 and 2 are linked as the number of values and the number of intervals is given by the number of rows in the dataset. So, there are two size parameter we can vary, the number of rows n and the number values within each interval m. In order to have reproducible and predictable benchmark data, the benchmark datasets are constructed by
d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
In case of n <- 4 and m <- 1 d becomes
   value min_val max_val count
   <num>   <num>   <num> <int>
1:     1       0       2    -1
2:     2       1       3    -1
3:     3       2       4    -1
4:     4       3       5    -1
In order to create equal conditions for each benchmark run the count column is pre-allocated with some dummy data.
The benchmark includes

henrik's suggestion to aggregate in a non-equi self join,
the 3 approaches in r2evans' 3 answer,
the answer by ThomasIsCoding,
the answer by Alexandre Léonard

Edit: A 3rd set of benchmark runs compares

henrik's aggregate in a non-equi self join, and
chinsoon12's Rcpp solution.

Unfortunately, TarJae's answer did not work for me.
library(data.table)
library(bench)
library(ggplot2)
options(datatable.print.class = TRUE)

bm1 <- press(
  n = 2^c(2, 10, 13),
  m = 2^c(0, 9, 12),
  {
    d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
    mark(
      henrik = {
        d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
      },
      r2evans0 = {
        d[, count := rowSums(outer(seq_len(.N), value, function(i, v) {min_val[i] < v & v < max_val[i];}))]
      },
      r2evans1 = {
        d[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]
      },
      r2evans2 = {
        d[, count := rowSums(outer(min_val, d$value, `<`) &
                               outer(max_val, d$value, `>`)),
          by = .(g = seq_len(nrow(d)) %/% 100)]
      },
      Thomas = {
        d[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))]
      },
      Alexandre = {
        d[, count := lapply(
          # seq.int(1, nrow(d)),
          seq_len(nrow(d)),
          function(i) sum(d[, value] > d[i, min_val] & d[, value] < d[i, max_val])
        )]
      },
      min_iterations = 3
    )
  }
)

autoplot(bm1)

Please, note the logarithmic time scale.
The chart exhibits that

Alexandre's approach is up to a magnitude slower than any of the other solutions (and may be omitted from further runs),
with increasing number of rows n henrik's approach becomes neck and neck with r2evans1 (worth to be investigated further),
the number of values in each interval m seems to have no or little effect on run time.

The latter can be verified by changing the facets and plotting the median times for different m in one facet:
ggplot(bm1) +
  aes(x = median, y = expression, color = as.factor(m)) +
  geom_point() + 
  facet_wrap(vars(n))

In the next chart below, plotting mem_alloc instead of median times exhibits that

m has no effect on memory allocation (with one exception),
for large n, henrik's approach needs magnitudes less memory than any other approach:


Please, note the log scale.
2nd set of benchmark runs
Based on the previous results, the next set of benchmark runs varies only size parameter n while m is kept constant. Alexandre's approach is omitted as it is too slow.
n is varied from 2^10 (1024) to 2^14 (16384) with m = 1.0. Unfortunately, the run with n = 2^15 was aborted due to lack of memory on my PC.
autoplot(bm2)

henrik's approach has taken lead in terms of speed for the 2^14 (16384) rows case.
To identify if this indicates a trend, run time vs problem size n is plotted by
ggplot(bm2) + 
  aes(x = n, y = median, color = expression, group = attr(expression, "description"), 
      label = ifelse(n == max(n), attr(expression, "description"), "")) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(trans = scales::log2_trans(), expand = expansion(mult = c(0.05, 0.1))) + 
  ggrepel::geom_label_repel(direction = "y", nudge_x = 1000) 

henrik's approach seems to have a high overhead but gains a speed advantage with increasing problem size.
Also with respect to memory allocation, henrik's approach aggregating in a non-equi self join seems to need substantially less memory than the other approaches. More importantly, memory allocation increases less steep with problem size which indicates that this approach can handle much larger problem sizes when available computer memory is a limiting factor.

EDIT: 3rd set of benchmark runs
This set of benchmark runs compares henrik's aggregate in a non-equi self join with chinsoon12's new Rcpp solution.
Due to the much smaller memory footprint of both approaches the problem size can be increased up to 2^18 (262144) rows before hitting the 16 GB memory limit on my Windows PC.
library(Rcpp)
bm4 <- press(
  n = 2^(10:18),
  {
    m <- 1.
    d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
    mark(
      henrik = {
        d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
      },
      chinsoon = {
        cppFunction("IntegerVector inrange(NumericVector v, NumericVector minv, NumericVector maxv) {
    int n = v.size();
    IntegerVector res(n);
    
    for (int r=0; r<n; r++) {
        for (int i=0; i<n; i++) {
            if (v[i] > minv[r] && v[i] < maxv[r]) {
                res[r]++;
            }
        }
    }
    
    return res;
}")
        d[, count := inrange(value, min_val, max_val)]
      },
      min_iterations = 3
    )
  }
)
The next two charts show the median run time and the memory allocation vs problem size, resp. (please, note the log scales):


Results for n = 2^18 (262144):
setDT(bm4)[n == 2^18, .(expression = unique(attr(expression, "description")), 
                        n, median, mem_alloc)]


   expression      n       median     mem_alloc
       <char>  <num> <bench_time> <bench_bytes>
1:     henrik 262144       17.47s       12.14MB
2:   chinsoon 262144        1.03m        2.06MB

Apparently, chinsoon's approach is faster for problem sizes up to 2^16 (65536) while henrik's approach ist faster for larger problem sizes (and seems to have a more linear time behaviour). For problem size n = 2^18, henrik's approach is almost 4 times faster than chinsoon's.
On the other hand, henrik's approach allocates much more memory than chinsoon's. For problem size n = 2^18, henrik's approach allocates about 6 time more memory than chinsoon's. Apparently, this ratio is constant for increasing problem size.
So, there is a tradeoff between speed (henrik's approach) and memory requirement (chinsoon's approach) depending on problem size. Your mileage may vary.

                        这篇关于计算一定范围内的行数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

计算一定范围内的行数 [英] Count number of rows within certain range

问题描述

第二组基准测试

第三组基准测试

2nd set of benchmark runs

EDIT: 3rd set of benchmark runs

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算一定范围内的行数 [英] Count number of rows within certain range

问题描述

第二组基准测试

第三组基准测试

2nd set of benchmark runs

EDIT: 3rd set of benchmark runs

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭