pandas -条件降落重复项 [英] Pandas - Conditional drop duplicates
问题描述
我有一个适用于Python 3.6x的Pandas 0.19.2数据框,如下所示.我想根据条件逻辑使用相同的Id
来drop_duplicates()
.
I have a Pandas 0.19.2 dataframe for Python 3.6x as below. I want to drop_duplicates()
with the same Id
based on a conditional logic.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'Id':[1,2,3,4,3,2,6,7,1,8],
'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'],
'Size':np.random.rand(10),
'Age':[19, 25, 22, 31, 43, 23, 44, 20, 51, 31]})
根据我下面描述的逻辑,实现此目标的最有效的方法(如果可能的话)是什么?
What would be the most efficient (if possible vectorised) way to achieve this based on the logic I describe below?
1)在删除重复项之前,对重复的Id
项的Size
求和.
1) Before dropping duplicates, sum the Size
of duplicate Id
entries.
2)删除相同Id
记录的重复项,保留具有更大Age
的记录.
2) Drop duplicates for same Id
records, keeping the one that has a larger Age
.
所需的输出将是:
Age Id Name Size
1 25 2 B 0.812662
3 31 4 D 0.302333
4 43 3 E 0.146870
6 44 6 G 0.186260
7 20 7 H 0.345561
8 51 1 I 0.813790
9 31 8 K 0.538817
推荐答案
使用 sort_values
和
Use GroupBy.transform
for aggregated values with same size as original DataFrame with sort_values
and drop_duplicates
for remove dupes:
df['Size'] = df.groupby('Id')['Size'].transform('sum')
df = df.sort_values('Age').drop_duplicates('Id', keep='last').sort_index()
print (df)
Id Name Size Age
1 2 B 0.812663 25
3 4 D 0.302333 31
4 3 E 0.146870 43
6 6 G 0.186260 44
7 7 H 0.345561 20
8 1 I 0.813789 51
9 8 K 0.538817 31
这篇关于 pandas -条件降落重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!