快速分类(装箱) [英] fast categorization (binning)
问题描述
我有很多条目,每个条目都是一个浮点数.这些数据x
可通过迭代器访问.我需要使用选择10<y<=20
,20<y<=50
,....对所有条目进行分类,其中y
是来自其他可迭代对象的数据.条目的数量远远大于选择的数量.最后,我想要一个像这样的字典:
I've a huge number of entries, every one is a float number. These data x
are accesible with an iterator. I need to classify all the entries using selection like 10<y<=20
, 20<y<=50
, .... where y
are data from an other iterables. The number of entries is much more than the number of selections. At the end I want a dictionary like:
{ 0: [all events with 10<x<=20],
1: [all events with 20<x<=50], ... }
或类似的东西.例如,我在做:
or something similar. For example I'm doing:
for x, y in itertools.izip(variable_values, binning_values):
thebin = binner_function(y)
self.data[tuple(thebin)].append(x)
y
通常是多维的.
这非常慢,有没有更快的解决方案,例如使用numpy?我认为问题出在我使用的list.append
方法而不是binner_function
This is very slow, is there a faster solution, for example with numpy? I think the problem cames from the list.append
method I'm using and not from the binner_function
推荐答案
在numpy中获取分配的一种快速方法是使用np.digitize
:
A fast way to get the assignments in numpy is using np.digitize
:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html
您仍然必须将结果分配分成几组.如果x
或y
是多维的,则必须首先将数组展平.然后,您可以获取唯一的bin分配,然后与np.where
一起遍历这些分配,以将分配分为几组.如果bin的数量比需要合并的元素的数量小得多,这可能会更快.
You'd still have to split the resulting assignments up into groups. If x
or y
is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where
to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.
作为一个微不足道的示例,您将需要针对特定问题进行调整/详细说明(但希望足以使您开始使用numpy解决方案):
As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):
In [1]: import numpy as np
In [2]: x = np.random.normal(size=(50,))
In [3]: b = np.linspace(-20,20,50)
In [4]: assign = np.digitize(x,b)
In [5]: assign
Out[5]:
array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25,
25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24,
25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24])
In [6]: uid = np.unique(assign)
In [7]: adict = {}
In [8]: for ii in uid:
...: adict[ii] = np.where(assign == ii)[0]
...:
In [9]: adict
Out[9]:
{23: array([ 0, 8, 11, 25, 36, 37]),
24: array([ 4, 6, 9, 24, 29, 30, 33, 35, 40, 49]),
25: array([ 1, 2, 3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]),
26: array([ 5, 7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]),
27: array([14, 32, 43, 46])}
要处理展平然后取消展平的numpy数组,请参见: http://docs.scipy.org/doc/numpy/reference /generation/numpy.unravel_index.html
For dealing with flattening and then unflattening numpy arrays, see: http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html
http://docs.scipy.org/doc/numpy/reference/生成/numpy.ravel_multi_index.html
这篇关于快速分类(装箱)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!