快速分类(装箱) [英] fast categorization (binning)

查看:93
本文介绍了快速分类(装箱)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多条目,每个条目都是一个浮点数.这些数据x可通过迭代器访问.我需要使用选择10<y<=2020<y<=50,....对所有条目进行分类,其中y是来自其他可迭代对象的数据.条目的数量远远大于选择的数量.最后,我想要一个像这样的字典:

I've a huge number of entries, every one is a float number. These data x are accesible with an iterator. I need to classify all the entries using selection like 10<y<=20, 20<y<=50, .... where y are data from an other iterables. The number of entries is much more than the number of selections. At the end I want a dictionary like:

{ 0: [all events with 10<x<=20],
  1: [all events with 20<x<=50], ... }

或类似的东西.例如,我在做:

or something similar. For example I'm doing:

for x, y in itertools.izip(variable_values, binning_values):
    thebin = binner_function(y)
    self.data[tuple(thebin)].append(x)

y通常是多维的.

这非常慢,有没有更快的解决方案,例如使用numpy?我认为问题出在我使用的list.append方法而不是binner_function

This is very slow, is there a faster solution, for example with numpy? I think the problem cames from the list.append method I'm using and not from the binner_function

推荐答案

在numpy中获取分配的一种快速方法是使用np.digitize:

A fast way to get the assignments in numpy is using np.digitize:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html

您仍然必须将结果分配分成几组.如果xy是多维的,则必须首先将数组展平.然后,您可以获取唯一的bin分配,然后与np.where一起遍历这些分配,以将分配分为几组.如果bin的数量比需要合并的元素的数量小得多,这可能会更快.

You'd still have to split the resulting assignments up into groups. If x or y is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.

作为一个微不足道的示例,您将需要针对特定​​问题进行调整/详细说明(但希望足以使您开始使用numpy解决方案):

As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):

In [1]: import numpy as np

In [2]: x = np.random.normal(size=(50,))

In [3]: b = np.linspace(-20,20,50)

In [4]: assign = np.digitize(x,b)

In [5]: assign
Out[5]: 
array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25,
       25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24,
       25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24])

In [6]: uid = np.unique(assign)

In [7]: adict = {}

In [8]: for ii in uid:
   ...:     adict[ii] = np.where(assign == ii)[0]
   ...:     

In [9]: adict
Out[9]: 
{23: array([ 0,  8, 11, 25, 36, 37]),
 24: array([ 4,  6,  9, 24, 29, 30, 33, 35, 40, 49]),
 25: array([ 1,  2,  3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]),
 26: array([ 5,  7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]),
 27: array([14, 32, 43, 46])}

要处理展平然后取消展平的numpy数组,请参见: http://docs.scipy.org/doc/numpy/reference /generation/numpy.unravel_index.html

For dealing with flattening and then unflattening numpy arrays, see: http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html

http://docs.scipy.org/doc/numpy/reference/生成/numpy.ravel_multi_index.html

这篇关于快速分类(装箱)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆