将numpy.digitize扩展为多维数据 [英] Extending numpy.digitize to multi-dimensional data

查看:72
本文介绍了将numpy.digitize扩展为多维数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组大型数组(每个数组约600万个元素),我希望基本上执行np.digitalize但要在多个轴上进行.我正在寻找有关如何有效执行此操作以及如何存储结果的一些建议.

I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results.

我需要数组A的所有索引(或所有值或掩码),其中数组B的值在一个范围内,数组C的值在另一个范围内,D在另一个范围内.我想要值,索引或掩码,以便可以对每个bin中的A数组的值进行一些尚未确定的统计.我还需要每个bin中的元素数量,但是len()可以做到这一点.

I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but len() can do that.

这是我研究的一个看起来合理的例子:

Here is one example I worked up that seems reasonable:

import itertools
import numpy as np

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb),
                                                   (C_Cidx==cc),
                                                   (D_Didx==dd))]]])

但是,这让我感到紧张,因为我将无法在大型阵列上用尽内存.

This however makes me nervous that I will run out of memory on large arrays.

我也可以这样:

b_inds = np.empty((len(A), 10), dtype=np.bool)
c_inds = np.empty((len(A), 12), dtype=np.bool)
d_inds = np.empty((len(A), 24), dtype=np.bool)
for i in range(10):
    b_inds[:,i] = B_Bidx = i     
for i in range(12):
    c_inds[:,i] = C_Cidx = i     
for i in range(24):
    d_inds[:,i] = D_Didx = i     
# get the A data for the 1,2,3 B,C,D bin
print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]

至少在这里输出的大小是已知且恒定的.

At least here the output is of known and constant size.

有人对此有更好的想法吗?还是需要澄清?

Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed?

根据HYRY的回答,这就是我决定采用的方法.

Based on the answer by HYRY this is the path I decided to take.

import numpy as np
import pandas as pd

np.random.seed(42)
A =  np.random.random_sample(1e7)
B = (np.random.random_sample(1e7) + 10)*20
C = (np.random.random_sample(1e7) + 20)*40
D = (np.random.random_sample(1e7) + 80)*80
# make the edges of the bins we want
Bbins = np.linspace(B.min(), B.max(), 9)
Cbins = np.linspace(C.min(), C.max(), 10) # note different number
Dbins = np.linspace(D.min(), D.max(), 11) # note different number
sA = pd.Series(A)
cB = pd.cut(B, Bbins, include_lowest=True)
cC = pd.cut(C, Cbins, include_lowest=True)
cD = pd.cut(D, Dbins, include_lowest=True)

dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels})
g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices
# this then gives all the indices that match the group 
print g[0,1,2]
# this is all the array A data for that B,C,D bin
print sA[g[0,1,2]]

即使对于巨大的阵列,此方法也似乎闪电般快.

This method seems lightning fast even for huge arrays.

推荐答案

如何在熊猫中使用groupby.首先解决您代码中的一些问题:

How about use groupby in Pandas. Fix some problem in your code first:

import itertools
import numpy as np

np.random.seed(42)

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]])

a_bins[1000]

输出:

[(4, 6, 17), array([ 0.70723863,  0.907611  ,  0.46214047])]

以下是熊猫返回相同结果的代码:

Here is the code that return the same result by Pandas:

import pandas as pd

cB = pd.cut(B, 9)
cC = pd.cut(C, 11)
cD = pd.cut(D, 23)

sA = pd.Series(A)
g = sA.groupby([cB.labels, cC.labels, cD.labels])
g.get_group((3, 5, 16))

输出:

800     0.707239
2320    0.907611
9388    0.462140
dtype: float64

如果要计算每个组的某些统计信息,则可以调用g的方法,例如:

If you want to calculate some statistics of every group, you can call the method of g, for example:

g.mean()

返回:

0  0  0     0.343566
      1     0.410979
      2     0.700007
      3     0.189936
      4     0.452566
      5     0.565330
      6     0.539565
      7     0.530867
      8     0.568120
      9     0.587762
      11    0.352453
      12    0.484903
      13    0.477969
      14    0.484328
      15    0.467357
...
8  10  8     0.559859
       9     0.570652
       10    0.656718
       11    0.353938
       12    0.628980
       13    0.372350
       14    0.404543
       15    0.387920
       16    0.742292
       17    0.530866
       18    0.389236
       19    0.628461
       20    0.387384
       21    0.541831
       22    0.573023
Length: 2250, dtype: float64

这篇关于将numpy.digitize扩展为多维数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆