Pandas - Groupby 对象行为 [英] Pandas - Groupby object behavior

查看:67
本文介绍了Pandas - Groupby 对象行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在努力解决 pandas groupby 对象的行为.我刚刚从 0.2.5 版切换到 1.2.3 版,我的代码不再表现相同.

在 0.2.5 版本中,当我按多列进行分组时,结果为 0 的所有行基本上都被删除了.但是在我使用的最新版本中,我发现每列中的所有唯一值都被分组,导致许多行因此显示为 0.

代码示例:

df.groupby(['ColumnA', 'ColumnB'])['ColumnC'].count()

结果为 0.2.5:

A 列 |B列|计数结果

1.2.3 中的结果:
A 列 - 值 1 |B 列 - 值 1 |2
A 列 - 值 1 |B 列 - 值 2 |0
A 列 - 值 2 |B 列 - 值 1 |0
A 列 - 值 2 |B 列 - 值 2 |0

这会产生很多基本无用的不必要的线条.当您处理包含数百万行和每列数千个唯一值的大型数据集时,这会变得特别烦人.我如何强制执行我以前版本的行为,因为这意味着我将不得不重做我创建的许多功能.在从不同版本过渡的过程中,我错过了什么?

解决方案

似乎与 Categoricals 一起工作,需要参数 observed=True 以避免将缺少的类别添加到 DataFrame.groupby:

<块引用>

观察到,默认为False

<块引用>

这仅适用于任何石斑鱼是分类鱼的情况.如果为 True:仅显示分类石斑鱼的观察值.如果为 False:显示分类石斑鱼的所有值.

df.groupby(['ColumnA', 'ColumnB'], Observed=True)['ColumnC'].count()

I am struggeling currently with the behavior of of the pandas groupby object. I have just switched from version 0.2.5 to 1.2.3 and my code does not behave the same anymore.

In version 0.2.5 when I did a groupby by multiple columns all lines where the result was 0 were basically dropped. But in the recent version I am using I get that all unique values from each columns are grouped leading to many lines showing 0 as a result thereof.

Code example:

df.groupby(['ColumnA', 'ColumnB'])['ColumnC'].count()

Result in 0.2.5:

ColumnA | ColumnB | Result of Count

Result in 1.2.3:
Column A - Value 1 | Column B - Value 1 | 2
Column A - Value 1 | Column B - Value 2 | 0
Column A - Value 2 | Column B - Value 1 | 0
Column A - Value 2 | Column B - Value 2 | 0

This creates a lot of unnecessary lines which are bascially useless. This becomes especially annoying when you work with large dataset of millions of lines and thousands of unique values per column. How can I force the behaviour from my previous version because this would mean that I would have to redo a lot of function which I have created. What did I missed in the transition from the different versions?

解决方案

It seems working with Categoricals, need parameter observed=True for avoid add missing categories to DataFrame.groupby:

observed, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

df.groupby(['ColumnA', 'ColumnB'], observed=True)['ColumnC'].count()

这篇关于Pandas - Groupby 对象行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆