Pandas Dataframe选择基数最小的组 [英] Pandas Dataframe selecting groups with minimal cardinality

查看:80
本文介绍了Pandas Dataframe选择基数最小的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,我需要从数据帧中获取一组行,其中一组中的项数超过一定数量(截止).对于那些小组,我需要排一些头排和尾排.

I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.

我正在使用下面的代码

train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

这有效.但是第一行非常慢:(.30分钟或更长时间.

This works. But the first line, it is very slow :(. 30 minutes or more.

有什么方法可以使第一行更快?如果我不使用第一行,那么第二行的结果中会有重复的索引,这会弄乱事情.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

先谢谢了 问候

注意: 我的火车数据框有大约70,000个组,每个组的大小超过700,000行.它实际上来自我的其他问题,如此处所示在Python Pandas Dataframe中动态添加列的数据处理. Jeff在此处给出了一个很好的答案,但是如果组大小小于或等于我在连接行时传递给head(parameter)的参数,则失败,如Jeffs的回答:在[31]中:groups = concat .....

Note: My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe. Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....

推荐答案

使用groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

这只会返回数据框的行,其中组的大小大于截止值.而且,它应该表现得更好.我在这里给filter设置了一个数据帧,该数据帧包含30039个"id"组和超过400万个观察值:

This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop

这篇关于Pandas Dataframe选择基数最小的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆