如何基于组标准使用 pandas 对数据框进行子集化? [英] How to subset a data frame using Pandas based on a group criteria?

查看:39
本文介绍了如何基于组标准使用 pandas 对数据框进行子集化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的大型数据集

I have a large data set with the following structure

User     X
1        0
1        0
2        0
2        0
2        1
3        0
3        0

我想获取数据的一个子集,以使每个用户的X列的总和为0.在上面的示例中,该子集应仅包括对用户1和3的观察值,如下所示:

I would like to take a subset of the data such that the sum of column X for each User is 0. Given the above example, the subset should only include the observations for users 1 and 3 as follows

User     X
1        0
1        0
3        0
3        0

有没有一种方法可以使用groupby函数而不对数据进行分组?我希望该子集包括各个观察值.

Is there a way to do this using the groupby function without grouping the data? I want the subset to include the individual observations.

推荐答案

DSM的答案,它使用布尔值选择行掩码,即使DataFrame具有非唯一索引也可以正常工作. 我的方法使用索引值选择行,当索引唯一时,它的速度会稍微慢一些,而当索引包含重复值时,速度会大大慢.

DSM's answer, which selects rows using a boolean mask, works well even if the DataFrame has a non-unique index. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.

@roland:请考虑改用DSM的答案.

@roland: Please consider accepting DSM's answer instead.

您可以使用 groupby-filter :

You could use a groupby-filter:

In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
Out[16]: 
   User  X
0     1  0
1     1  0
5     3  0
6     3  0


groupby过滤器本身仅返回以下内容:


By itself, the groupby-filter just returns this:

In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0)
Out[29]: 
0    0
1    0
5    0
6    0
Name: X, dtype: int64

,但是您可以使用其索引

but you can then use its index,

In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index
Out[30]: Int64Index([0, 1, 5, 6], dtype='int64')

使用df.loc选择所需的行.

这是我使用的基准:

In [49]: df2 = pd.concat([df]*10000)   # df2 has a non-unique index

Ctrl - C 选择了这个,因为它花费的时间太长了:

I Ctrl-C'd this one because it was taking too long to finish:

In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]

当我意识到自己的错误时,我制作了一个具有唯一索引的DataFrame:

When I realized my mistake, I made a DataFrame with a unique index:

In [51]: df3 = df2.reset_index()     # this gives df3 a unique index

In [52]: %timeit df3.loc[df3.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
100 loops, best of 3: 13 ms per loop

In [53]: %timeit df3.loc[df3.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.4 ms per loop

这表明DSM的方法即使在索引不唯一的情况下也能表现良好:

This shows DSM's method performs well even with a non-unique index:

In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.2 ms per loop

这篇关于如何基于组标准使用 pandas 对数据框进行子集化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆