将点分配到垃圾箱 [英] assigning points to bins

查看:110
本文介绍了将点分配到垃圾箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将数值归入一定范围内的一种好方法是什么?例如,假设我有一个值列表,我想按它们的范围将它们分成N个箱.现在,我做这样的事情:

What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:

from scipy import *
num_bins = 3 # number of bins to use
values = # some array of integers...
min_val = min(values) - 1
max_val = max(values) + 1
my_bins = linspace(min_val, max_val, num_bins)
# assign point to my bins
for v in values:
  best_bin = min_index(abs(my_bins - v))

其中min_index返回最小值的索引.这样的想法是,您可以通过查看与之具有最小差异的仓来找到该点所在的仓.

where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.

但是我认为这有一些奇怪的情况.我正在寻找的是箱的良好表示形式,理想情况下是半封闭半开的箱(这样就无法将一个点分配给两个箱),即

But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.

bin1 = [x1, x2)
bin2 = [x2, x3)
bin3 = [x3, x4)
etc...

使用numpy/scipy在Python中执行此操作的好方法是什么?我在这里只关心合并整数值.

what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.

非常感谢您的帮助.

推荐答案

numpy.histogram()确实满足您的要求.

函数签名为:

numpy.histogram(a, bins=10, range=None, normed=False, weights=None, new=None)

我们对abins最为感兴趣. a是需要合并的输入数据. bins可以是许多仓(您的num_bins),也可以是标量序列,表示仓边缘(半开).

We're mostly interested in a and bins. a is the input data that needs to be binned. bins can be a number of bins (your num_bins), or it can be a sequence of scalars, which denote bin edges (half open).

import numpy
values = numpy.arange(10, dtype=int)
bins = numpy.arange(-1, 11)
freq, bins = numpy.histogram(values, bins)
# freq is now [0 1 1 1 1 1 1 1 1 1 1]
# bins is unchanged

引用文档:

除最后一个(最右边)的垃圾箱外,其他所有垃圾箱都是半开的.换句话说,如果bins是:

[1, 2, 3, 4]

然后第一个bin是[1, 2)(包括1,但不包括2),第二个是[2, 3).但是,最后一个bin是[3, 4],其中包括 4.

then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

编辑:您想知道每个元素箱中的索引.为此,您可以使用numpy.digitize().如果您的垃圾桶将成为一体,则也可以使用numpy.bincount().

Edit: You want to know the index in your bins of each element. For this, you can use numpy.digitize(). If your bins are going to be integral, you can use numpy.bincount() as well.

>>> values = numpy.random.randint(0, 20, 10)
>>> values
array([17, 14,  9,  7,  6,  9, 19,  4,  2, 19])
>>> bins = numpy.linspace(-1, 21, 23)
>>> bins
array([ -1.,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,
        10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,
        21.])
>>> pos = numpy.digitize(values, bins)
>>> pos
array([19, 16, 11,  9,  8, 11, 21,  6,  4, 21])

由于间隔是在上限处打开的,因此索引是正确的:

Since the interval is open on the upper limit, the indices are correct:

>>> (bins[pos-1] == values).all()
True
>>> import sys
>>> for n in range(len(values)):
...     sys.stdout.write("%g <= %g < %g\n"
...             %(bins[pos[n]-1], values[n], bins[pos[n]]))
17 <= 17 < 18
14 <= 14 < 15
9 <= 9 < 10
7 <= 7 < 8
6 <= 6 < 7
9 <= 9 < 10
19 <= 19 < 20
4 <= 4 < 5
2 <= 2 < 3
19 <= 19 < 20

这篇关于将点分配到垃圾箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆