基于多列分组依据大小的Dask筛选器数据框 [英] Dask Filter Dataframe on Multi-Column Groupby Size
本文介绍了基于多列分组依据大小的Dask筛选器数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
目标=通过一个淡淡的数据帧进行多列分组,并过滤出少于3行的分组。
Goal = Multi-column groupby a dask dataframe, and filter out groups that contain less than 3 rows.
基于此帖子:
在Dask中过滤分组的df
我能够计算每个groupby对象的大小,但是我不知道如何将其映射回我的数据框来自mutli-column groupby。我尝试了以下多种变体,但无济于事:
I'm able to calculate the size of each groupby object, but I cannot figure out how to map it back to my dataframe from the mutli-column groupby. I tried many variations of the following to no avail:
a = input_df.groupby(["FeatureID", "region"])["Target"].size()
s = input_df[["FeatureID", "region"]].map(a)
对于单个列groupby来说效果很好。
It works great for a single column groupby.
由于@jezrael,我能够提出以下解决方案:
Thanks to @jezrael I was able to come up with the following solution:
a = input_df.groupby(["FeatureID", "region"])["Target"].nunique().to_frame("feature_div")
input_df = input_df.join(a, on=["FeatureID", "region"])
# filter out features below diversity threshold
diversified = input_df[input_df.feature_div >= diversity_threshold]
推荐答案
您需要加入
与 to_frame
:
a = input_df.groupby(["FeatureID", "region"])["Target"].size().to_frame('New')
input_df = input_df.join(a, on=["FeatureID", "region"])
示例:
import pandas as pd
from dask import dataframe as dd
input_df = pd.DataFrame({
'FeatureID':[4,5,4,5,5,4],
'region':list('aaabbb'),
'Target':[7,8,9,4,2,3],
})
print (input_df)
FeatureID region Target
0 4 a 7
1 5 a 8
2 4 a 9
3 5 b 4
4 5 b 2
5 4 b 3
sd = dd.from_pandas(input_df, npartitions=3)
print (sd)
FeatureID region Target
npartitions=3
0 int64 object int64
2 ... ... ...
4 ... ... ...
5 ... ... ...
Dask Name: from_pandas, 3 tasks
a = sd.groupby(["FeatureID", "region"])["Target"].size().to_frame('New')
out = sd.join(a, on=["FeatureID", "region"]).compute()
print (out)
FeatureID region Target New
0 4 a 7 2
1 5 a 8 1
2 4 a 9 2
3 5 b 4 2
4 5 b 2 2
5 4 b 3 1
这篇关于基于多列分组依据大小的Dask筛选器数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文