有效地获取Python中直方图bin的索引 [英] Efficiently get indices of histogram bins in Python

查看:229
本文介绍了有效地获取Python中直方图bin的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大10000x10000元素的图像,将其分成几百个不同的扇区/箱.然后,我需要对每个bin中包含的值执行一些迭代计算.

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.

如何提取每个bin的索引以使用bin值有效地执行我的计算?

How do I extract the indices of each bin to efficiently perform my calculation using the bins values?

我正在寻找一种解决方案,它避免了必须从大型阵列中每次选择ind == j的瓶颈.有没有一种方法可以一次性直接获取属于每个bin的元素的索引?

What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?

实现我需要的一种方法是使用如下代码(请参见例如相关答案),其中我将值数字化,然后有一个j循环选择与j相等的数字化索引,如下所示:

One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below

import numpy as np

# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
    y = np.sum(x)
    return y

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = [func(vals[ind == j]) for j in range(1, nbins)]

这不是我想要的,因为它每次从大型阵列中选择ind == j时都会选择它.这使得该解决方案非常低效且缓慢.

This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.

以上方法与

The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following

import numpy as np
from scipy.stats import binned_statistics

vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3.使用labeled_comprehension

另一种Scipy替代方法是使用

3. Using labeled_comprehension

Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become

import numpy as np
from scipy.ndimage import labeled_comprehension

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

不幸的是,这种形式的效率也不高,特别是与我最初的示例相比,它没有速度优势.

Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.

为了进一步澄清,我正在寻找的功能等同于IDL语言

To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?

具体来说,使用IDL语言,上面的示例可以写为

Specifically, using the IDL language the above example could be written as

vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)

for j=0, nbins-1 do begin
    jbins = r[r[j]:r[j+1]-1]  ; Selects indices of bin j
    result[j] = func(vals[jbins])
endfor

上面的IDL实现比Numpy快约10倍,这是因为不必为每个bin选择bin的索引.而且,支持IDL实施的速度差异会随着存储箱数量的增加而增加.

The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.

推荐答案

我发现特定的稀疏矩阵构造函数可以非常有效地实现所需的结果.这有点晦涩难懂,但我们可以为此目的滥用它.下面的函数可以与 scipy.stats.binned_statistic ,但可以快几个数量级

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

我避免了np.digitize,因为它没有使用所有面元都是相等宽度,因此速度很慢的事实,但是我使用的方法可能无法完美地处理所有边缘情况.

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

这篇关于有效地获取Python中直方图bin的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆