pandas 一个衬里，通过特定列上的唯一计数过滤行 [英] Pandas one liner to filter rows by nunique count on a specific column

查看：72 发布时间：2020/5/24 0:29:31 python pandas

本文介绍了 pandas 一个衬里，通过特定列上的唯一计数过滤行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在大熊猫中，我经常使用以下内容按出现次数过滤数据框

In pandas, I regularly use the following to filter a dataframe by number of occurrences

df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)

假设df有另一列'B'，这次我想通过该列上唯一值的计数来过滤数据框，我希望类似

Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like

df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)

但这似乎不起作用，正确的方法是什么?

But that doesn't seem to work, what would be the right approach?

But if need faster solution use transform and filter by boolean indexing:

df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
   A  B  E
0  a  a  5
1  a  b  3

时间:

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)

THRESHOLD2 = 3

In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop

In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop

注意事项

给定组数，结果无法解决性能问题，其中某些解决方案的时序会受到很大影响.

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

这篇关于 pandas 一个衬里，通过特定列上的唯一计数过滤行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 一个衬里，通过特定列上的唯一计数过滤行 [英] Pandas one liner to filter rows by nunique count on a specific column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 一个衬里，通过特定列上的唯一计数过滤行 [英] Pandas one liner to filter rows by nunique count on a specific column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭