装箱,然后合并具有最少观察值的箱? [英] Binning and then combining bins with minimum number of observations?

查看:46
本文介绍了装箱,然后合并具有最少观察值的箱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我先创建一些数据,然后创建不同大小的垃圾箱:

Let's say I create some data and then create bins of different sizes:

from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())

显示:

20 17
16 1
4  1
2  1
dtype: int64

如果我将每个bin的最小阈值设置为至少2,则基本上我想转换基础数据,这样new_series.value_counts()是这样的:

I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:

20 17
16 3
dtype: int64

推荐答案

x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100

new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result

threshold = 2

for i in np.unique(n):
    if sum(n == i) <= threshold:
        n[n == i] += 1

n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)

这可以多次向上移动计数,直到一个垃圾箱被充满为止.

This can move counts up multiple times, until a bin is filled sufficiently.

代替使用np.digitize可能更简单,因为使用np.histogram可以直接为您提供计数,因此我们不需要自己sum.

Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.

这篇关于装箱,然后合并具有最少观察值的箱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆