为什么groupby sum不将布尔值转换为int或float? [英] Why doesn't groupby sum convert boolean to int or float?

查看:71
本文介绍了为什么groupby sum不将布尔值转换为int或float?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将从3个简单的示例开始:

I'll start with 3 simple examples:

pd.DataFrame([[True]]).sum()

0    1
dtype: int64


pd.DataFrame([True]).sum()

0    1
dtype: int64


pd.Series([True]).sum()

1


所有这些均符合预期.这是一个更复杂的示例.


All of these are as expected. Here is a more complicated example.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', True],
        ['b', 'B', True],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.Z.sum()

4

也符合预期.但是,如果我groupby(['X', 'Y']).sum()

Also as expected. However, if I groupby(['X', 'Y']).sum()

我希望它看起来像:

我在想错误.还有另一种解释吗?

I'm thinking bug. Is there another explanation?

每个@unutbu的答案

Per @unutbu's answer

pandas尝试将其重铸为原始dtypes.我以为也许我所表演的小组并没有真正地对任何小组进行分组.所以我尝试了这个例子来验证这个想法.

pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.

df = pd.DataFrame([
        ['a', 'A', False],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

我将groupby('X')sum.如果@unutbu是正确的,则这些总和应为10并可以转换为bool,因此我们应该看到bool

I'll groupby('X') and sum. If @unutbu is correct, these sums should be 1 and 0 and are castable to bool, therefore we should see bool

df.groupby('X').sum()

果然... bool

但是,如果过程相同,但值略有不同.

But if the process is the same but the values are slightly different.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.groupby('X').sum()

经验教训.执行此操作时,请始终使用astype(int)或类似的方法.

lesson learned. Always use astype(int) or something similar when doing this.

df.groupby('X').sum().astype(int)

在任何一种情况下都能获得一致的结果.

gives consistent results for either scenario.

推荐答案

之所以会发生这种情况,是因为

This occurs because _cython_agg_blocks calls _try_coerce_and_cast_result which calls _try_cast_result which tries to return a result of the same dtype as the original values (in this case, bool).

Z具有dtype bool(并且所有组的不超过一个True值)时,这将返回一些特殊的信息.如果这些组中的任何一个具有2个或多个True值,则由于_try_cast_result不会将2.0转换回布尔值,因此结果值将为浮点数.

This returns something a little peculiar when Z has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since _try_cast_result does not convert 2.0 back to a boolean.

_try_cast_result会做些更有用的事情:在内部,供以下人员使用的Cython聚合器 df.groupby(['X', 'Y']).sum()返回dtype floatresult.然后,在这里_try_cast_result将结果返回到dtype int.

_try_cast_result does something more useful when Z has dtype int: Internally, the Cython aggregator used by df.groupby(['X', 'Y']).sum() returns a result of dtype float. Here then, _try_cast_result returns the result to dtype int.

这篇关于为什么groupby sum不将布尔值转换为int或float?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆