在 Python 中有效地获取直方图 bin 的索引 [英] Efficiently get indices of histogram bins in Python

查看:29
本文介绍了在 Python 中有效地获取直方图 bin 的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短问题

我有一个大的 10000x10000 元素图像,我将其分为数百个不同的扇区/bin.然后我需要对每个 bin 中包含的值执行一些迭代计算.

如何提取每个 bin 的索引以使用 bin 值有效地执行计算?

我正在寻找的是一种解决方案,它可以避免每次 ind == j 从我的大数组中进行选择的瓶颈.有没有办法一次性直接获取属于每个 bin 的元素的索引?

详细说明

1.简单的解决方案

实现我需要的一种方法是使用如下代码(参见例如 THIS 相关答案),其中我数字化我的值,然后有一个 j 循环选择等于 j 的数字化索引,如下所示

将 numpy 导入为 np# 这个函数 func() 只是一个更复杂的函数的地标.# 我知道我的问题可以很容易地在特定情况下加速# sum() 函数,但我正在寻找该问题的通用解决方案.定义函数(x):y = np.sum(x)返回 yvals = np.random.random(1e8)nbins = 100bins = np.linspace(0, 1, nbins+1)ind = np.digitize(vals, bins)结果 = [func(vals[ind == j]) for j in range(1, nbins)]

这不是我想要的,因为它每次从我的大数组中选择 ind == j .这使得该解决方案非常低效且缓慢.

2.使用 binned_statistics

上述方法与 scipy.stats.binned_statistic,用于用户定义函数的一般情况.直接使用 Scipy 可以通过以下方式获得相同的输出

将 numpy 导入为 np从 scipy.stats 导入 binned_statisticsvals = np.random.random(1e8)结果 = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3.使用labeled_comprehension

另一种 Scipy 替代方法是使用 scipy.ndimage.measurements.labeled_comprehension.使用该函数,上面的例子将变成

将 numpy 导入为 np从 scipy.ndimage 导入labeled_comprehensionvals = np.random.random(1e8)nbins = 100bins = np.linspace(0, 1, nbins+1)ind = np.digitize(vals, bins)结果=labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

不幸的是,这种形式也效率低下,尤其是与我的原始示例相比,它没有速度优势.

4.与 IDL 语言的比较

为了进一步澄清,我正在寻找的是与IDL语言这里.这个非常有用的功能可以在 Python 中有效地复制吗?

具体来说,使用 IDL 语言,上面的例子可以写成

vals = randomu(s, 1e8)nbins = 100bins = [0:1:1./nbins]h = 直方图(vals,MIN=bins[0],MAX=bins[-2],NBINS=nbins,REVERSE_INDICES=r)结果 = dblarr(nbins)对于 j=0,nbins-1 开始jbins = r[r[j]:r[j+1]-1];选择 bin j 的索引结果[j] = func(vals[jbins])结束

上述 IDL 实现比 Numpy 快 10 倍,因为不必为每个 bin 选择 bin 的索引.并且有利于 IDL 实现的速度差异随着 bin 数量的增加而增加.

解决方案

我发现一个特定的稀疏矩阵构造函数可以非常有效地达到预期的结果.它有点晦涩,但我们可以为此目的滥用它.下面的函数可以以与 几乎相同的方式使用scipy.stats.binned_statistic 但可以快几个数量级

将 numpy 导入为 np从 scipy.sparse 导入 csr_matrixdef binned_statistic(x, values, func, nbins, range):'''用法和scipy.stats.binned_statistic差不多'''N = len(值)r0, r1 = 范围数字化 = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

我避免使用 np.digitize,因为它没有使用所有 bin 宽度相等的事实,因此速度很慢,但我使用的方法可能无法完美处理所有边缘情况.>

Short Question

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.

How do I extract the indices of each bin to efficiently perform my calculation using the bins values?

What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?

Detailed Explanation

1. Straightforward Solution

One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below

import numpy as np

# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
    y = np.sum(x)
    return y

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = [func(vals[ind == j]) for j in range(1, nbins)]

This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.

2. Using binned_statistics

The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following

import numpy as np
from scipy.stats import binned_statistics

vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3. Using labeled_comprehension

Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become

import numpy as np
from scipy.ndimage import labeled_comprehension

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.

4. Comparison with IDL language

To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?

Specifically, using the IDL language the above example could be written as

vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)

for j=0, nbins-1 do begin
    jbins = r[r[j]:r[j+1]-1]  ; Selects indices of bin j
    result[j] = func(vals[jbins])
endfor

The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.

解决方案

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

这篇关于在 Python 中有效地获取直方图 bin 的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆