在多个数据框中保留重复的行 [英] Keep duplicates rows in multiple dataframes

查看:112
本文介绍了在多个数据框中保留重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于以下数据框,如何提取并保存在不同的数据框中:

With following dataframes, how do I extract and keep in different dataframes:

  • 仅具有唯一Account的行
  • 所有重复Account s
  • 的行
  • rows with unique Account only
  • all rows with duplicated Accounts

我有两个数据集df[0] ...:

I have two datasets, df[0]...:

Account     Verified     Paid   Col1 Col2 Col3
1234        True        True     ...  ...  ...
1237        False       True    
1234        True        True
4211        True        True
1237        False       True
312         False       False

...和df[1]:

Account          Verified   Paid   Col1 Col2 Col3
41                True      True    ... ... ...
314               False     False
41                True      True
65                False     False

要遍历列表中的所有数据框而不替换我的df[i],并提取唯一的行,我使用了以下代码:

To pass through all dataframes in my list, without replacing my df[i], and extract unique rows I used the following code:

filt = [] 
for i in range(0,1): 
    filt.append(df[i].groupby(list(df[i].Account)).agg('first').reset_index())

但是,我也想通过列表中的所有数据框,并且仍然不替换我的df,而是用重复项提取行. 例如,在上面的示例中,我应该有一个包含帐户12341237的数据框,以及一个仅包含41的数据框.

However, I would be also interested in passing through all dataframes in my list and, still not replacing my df, extract rows with duplicates. For example, in the example above, I should have a dataframe that includes accounts 1234 and 1237, and a dataframe that includes only 41.

如何获得这两个数据集?

How could I get these two datasets?

推荐答案

使用 帐户"列中有两个重复的数据框. 无需编写您编写的逐行groupby hack.

You have two dataframes with some duplicates in the 'Account' column. There's no need to write the line-by-line groupby hack you wrote.

要获取仅具有唯一帐户(即删除重复项)的数据框,请使用

To get a dataframe with unique Accounts only i.e. duplicates dropped, use drop_duplicates(). See its keep=‘first’/‘last’/False (i.e. drop all) option, and inplace=True option.

>>> df[0].drop_duplicates('Account')    
   Account  Verified   Paid Col1 Col2 Col3
0     1234      True   True  ...  ...  ...
1     1237     False   True  NaN  NaN  NaN
3     4211      True   True  NaN  NaN  NaN
5      312     False  False  NaN  NaN  NaN

>>> df[1].drop_duplicates('Account')
   Account  Verified   Paid Col1 Col2 Col3
0       41      True   True  ...  ...  ...
1      314     False  False  NaN  NaN  NaN
3       65     False  False  NaN  NaN  NaN

要获取仅包含重复记录的数据框,请使用 .duplicated('Account', keep=False) ,表示保留所有重复项".

and to get a dataframe with duplicated records only, use .duplicated('Account', keep=False) which means 'keep all duplicates'.

>>> df[0][ df[0].duplicated('Account', keep=False) ]
   Account  Verified  Paid Col1 Col2 Col3
0     1234      True  True  ...  ...  ...
1     1237     False  True  NaN  NaN  NaN
2     1234      True  True  NaN  NaN  NaN
4     1237     False  True  NaN  NaN  NaN
>>> df[1][ df[1].duplicated('Account', keep=False) ]
   Account  Verified  Paid Col1 Col2 Col3
0       41      True  True  ...  ...  ...
2       41      True  True  NaN  NaN  NaN

您可能想按帐户"的顺序对最后两个数据框进行排序:

You might want to sort the last two dataframes in order of 'Account':

df[0][ df[0].duplicated('Account', keep=False) ].sort_values('Account')

注意:拥有多个数据框的列表df[i]并对其进行迭代并不是一种熊猫式习语.通常,最好合并或合并数据帧,并增加一列以区分它们来自何处. (同样,效率更高,我们只需执行一次groupbyapplydrop_duplicates等)

Note: it's not very pandas-idiom to have a list df[i] of multiple dataframes and iterate over it. Generally better to merge or concat the dataframes, and have one extra column to distinguish where they came from. (Also more efficient, we only need to do groupby, apply, drop_duplicates etc. once)

这篇关于在多个数据框中保留重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆