在多个数据框中保留重复的行 [英] Keep duplicates rows in multiple dataframes

查看：112 发布时间：2020/8/1 20:05:10 python pandas duplicates

本文介绍了在多个数据框中保留重复的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于以下数据框，如何提取并保存在不同的数据框中:

With following dataframes, how do I extract and keep in different dataframes:

仅具有唯一Account的行
所有重复Account s

rows with unique Account only
all rows with duplicated Accounts

我有两个数据集df[0] ...:

I have two datasets, df[0]...:

Account     Verified     Paid   Col1 Col2 Col3
1234        True        True     ...  ...  ...
1237        False       True    
1234        True        True
4211        True        True
1237        False       True
312         False       False

...和df[1]:

Account          Verified   Paid   Col1 Col2 Col3
41                True      True    ... ... ...
314               False     False
41                True      True
65                False     False

要遍历列表中的所有数据框而不替换我的df[i]，并提取唯一的行，我使用了以下代码:

To pass through all dataframes in my list, without replacing my df[i], and extract unique rows I used the following code:

filt = [] 
for i in range(0,1): 
    filt.append(df[i].groupby(list(df[i].Account)).agg('first').reset_index())

但是，我也想通过列表中的所有数据框，并且仍然不替换我的df，而是用重复项提取行. 例如，在上面的示例中，我应该有一个包含帐户1234和1237的数据框，以及一个仅包含41的数据框.

However, I would be also interested in passing through all dataframes in my list and, still not replacing my df, extract rows with duplicates. For example, in the example above, I should have a dataframe that includes accounts 1234 and 1237, and a dataframe that includes only 41.

如何获得这两个数据集?

How could I get these two datasets?

You have two dataframes with some duplicates in the 'Account' column. There's no need to write the line-by-line groupby hack you wrote.

要获取仅具有唯一帐户(即删除重复项)的数据框，请使用

To get a dataframe with unique Accounts only i.e. duplicates dropped, use drop_duplicates(). See its keep=‘first’/‘last’/False (i.e. drop all) option, and inplace=True option.

>>> df[0].drop_duplicates('Account') Account Verified Paid Col1 Col2 Col3 0 1234 True True ... ... ... 1 1237 False True NaN NaN NaN 3 4211 True True NaN NaN NaN 5 312 False False NaN NaN NaN >>> df[1].drop_duplicates('Account') Account Verified Paid Col1 Col2 Col3 0 41 True True ... ... ... 1 314 False False NaN NaN NaN 3 65 False False NaN NaN NaN

和要获取仅包含重复记录的数据框，请使用 .duplicated('Account', keep=False) ，表示保留所有重复项".

and to get a dataframe with duplicated records only, use .duplicated('Account', keep=False) which means 'keep all duplicates'.

>>> df[0][ df[0].duplicated('Account', keep=False) ] Account Verified Paid Col1 Col2 Col3 0 1234 True True ... ... ... 1 1237 False True NaN NaN NaN 2 1234 True True NaN NaN NaN 4 1237 False True NaN NaN NaN >>> df[1][ df[1].duplicated('Account', keep=False) ] Account Verified Paid Col1 Col2 Col3 0 41 True True ... ... ... 2 41 True True NaN NaN NaN

您可能想按帐户"的顺序对最后两个数据框进行排序:

You might want to sort the last two dataframes in order of 'Account':

df[0][ df[0].duplicated('Account', keep=False) ].sort_values('Account')

注意:拥有多个数据框的列表df[i]并对其进行迭代并不是一种熊猫式习语.通常，最好合并或合并数据帧，并增加一列以区分它们来自何处. (同样，效率更高，我们只需执行一次groupby，apply，drop_duplicates等)

Note: it's not very pandas-idiom to have a list df[i] of multiple dataframes and iterate over it. Generally better to merge or concat the dataframes, and have one extra column to distinguish where they came from. (Also more efficient, we only need to do groupby, apply, drop_duplicates etc. once)

这篇关于在多个数据框中保留重复的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在多个数据框中保留重复的行 [英] Keep duplicates rows in multiple dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在多个数据框中保留重复的行 [英] Keep duplicates rows in multiple dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭