当使用cut in pandas数据框对其进行装箱时,为什么装箱未正确完成? [英] When using cut in a pandas dataframe to bin it, why is the binning not properly done?

查看:37
本文介绍了当使用cut in pandas数据框对其进行装箱时,为什么装箱未正确完成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想按一列进行分类(即分组为子范围),并对每个分类取第二列的平均值:

I have a dataframe that I want to bin (i.e., group into sub-ranges) by one column, and take the mean of the second column for each of the bins:

import pandas as pd
import numpy as np

data = pd.DataFrame(columns=['Score', 'Age'])
data.Score = [1, 1, 1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, -1, 1, 0, 1, 1, 0, 1, 0, -2, 1]
data.Age = [29, 59, 44, 52, 60, 53, 45, 47, 57, 54, 35, 32, 48, 31, 49, 43, 67, 32, 31, 42, 37, 45, 52, 59, 56, 57, 48, 45, 56, 31]

_, bins = np.histogram(data.Age, 10)
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df

这种分箱有两个问题:

  1. (n-1)个分箱的上限与第 n 个分箱的下限之间存在1的距离(这意味着分箱不连续,将跳过位于此间隙中的数据点.
  2. 最后几个bin限制在小数点后有很多数字.我在 cut 中使用了 precision = 0 标志,但无论我在 precision = x ,它仍会生成bin,最后几个bin的小数点后有很多数字.
  1. there is a gap of 1 between the upper bound of the (n-1)th bin and the lower bound of the nth bin (which means the binning is not continuous, and data points that lie in this gap are skipped).
  2. the last few bin limits have a lot of digits after the decimal place. I have used the precision=0 flag in the cut, but it seems to be of no use - no matter what x I use in precision=x, it still produces the bins with the last few bins having a lot of digits after the decimal point.

第二点导致了问题,例如,当我尝试绘制 df 时,它破坏了x轴的外观:

The second point causes problem when, for instance, I try to plot df, where it ruins the look of the x-axis:

import matplotlib.pyplot as plt
plt.plot([str(i) for i in df.Age], df.Score, 'o-')

为什么会出现这种情况,尽管我设置了 precision = 0 标志,但我暗示我只希望整数作为bin限制,而不是浮点数?以及我该如何解决?

Why is this occurring inspite of the precision=0 flag that I put to imply I want only integers as the bin limits, and not floats? And how do I fix it?

我暂时通过手动将bin值转换为 int s来解决此问题:

I'm temporarily solving this issue by converting the bin values to ints manually:

_, bins = np.histogram(data.Age, 10)
for i in range(len(bins)): # my fix
    bins[i] = int(bins[i])
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df

但是,这感觉像是骇客,我认为它应该有一个适当的"解决方案,而不是骇客的修复程序.而且,尽管它解决了第二个问题,但是我不确定这是否解决了第一个问题.

But this feels like a hack, and I think it should have a "proper" solution instead of a hacky fix. And although it fixed the second issue, I'm not sure if this fixes the first issue.

推荐答案

关于您在问题中提到的两个问题,这两个问题均源于代码中的一行,即

Regarding the two issues you mentioned in your question, both of them result from one line in your code which is

labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]

gab是由 i + 1 产生的,数字也是由计算机逼近同一行而产生的.

The gab resulted from i+1, also the digits resulted from computer approximation in the same line.

因此,将其修改为

labels = [f'{i:.1f}-{j:.1f}' for i, j in zip(bins[:-1], bins[1:])]

我们将其近似为一位.

,不需要 labels [0] ='{}-{}'.format(bins [0],bins [1])

这篇关于当使用cut in pandas数据框对其进行装箱时,为什么装箱未正确完成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆