在多个数据框中保留重复的行 [英] Keep duplicates rows in multiple dataframes
问题描述
对于以下数据框,如何提取并保存在不同的数据框中:
With following dataframes, how do I extract and keep in different dataframes:
- 仅具有唯一
Account
的行 - 所有重复
Account
s 的行
- rows with unique
Account
only - all rows with duplicated
Account
s
我有两个数据集df[0]
...:
I have two datasets, df[0]
...:
Account Verified Paid Col1 Col2 Col3
1234 True True ... ... ...
1237 False True
1234 True True
4211 True True
1237 False True
312 False False
...和df[1]
:
Account Verified Paid Col1 Col2 Col3
41 True True ... ... ...
314 False False
41 True True
65 False False
要遍历列表中的所有数据框而不替换我的df[i]
,并提取唯一的行,我使用了以下代码:
To pass through all dataframes in my list, without replacing my df[i]
, and extract unique rows I used the following code:
filt = []
for i in range(0,1):
filt.append(df[i].groupby(list(df[i].Account)).agg('first').reset_index())
但是,我也想通过列表中的所有数据框,并且仍然不替换我的df,而是用重复项提取行.
例如,在上面的示例中,我应该有一个包含帐户1234
和1237
的数据框,以及一个仅包含41
的数据框.
However, I would be also interested in passing through all dataframes in my list and, still not replacing my df, extract rows with duplicates.
For example, in the example above, I should have a dataframe that includes accounts 1234
and 1237
, and a dataframe that includes only 41
.
如何获得这两个数据集?
How could I get these two datasets?
推荐答案
使用 duplicated('Account', keep=False)
.
帐户"列中有两个重复的数据框.
无需编写您编写的逐行groupby
hack.
You have two dataframes with some duplicates in the 'Account' column.
There's no need to write the line-by-line groupby
hack you wrote.
To get a dataframe with unique Accounts only i.e. duplicates dropped, use drop_duplicates()
. See its
keep=‘first’/‘last’/False (i.e. drop all)
option, and inplace=True
option.
>>> df[0].drop_duplicates('Account')
Account Verified Paid Col1 Col2 Col3
0 1234 True True ... ... ...
1 1237 False True NaN NaN NaN
3 4211 True True NaN NaN NaN
5 312 False False NaN NaN NaN
>>> df[1].drop_duplicates('Account')
Account Verified Paid Col1 Col2 Col3
0 41 True True ... ... ...
1 314 False False NaN NaN NaN
3 65 False False NaN NaN NaN
和要获取仅包含重复记录的数据框,请使用 .duplicated('Account', keep=False)
,表示保留所有重复项".
and to get a dataframe with duplicated records only, use .duplicated('Account', keep=False)
which means 'keep all duplicates'.
>>> df[0][ df[0].duplicated('Account', keep=False) ]
Account Verified Paid Col1 Col2 Col3
0 1234 True True ... ... ...
1 1237 False True NaN NaN NaN
2 1234 True True NaN NaN NaN
4 1237 False True NaN NaN NaN
>>> df[1][ df[1].duplicated('Account', keep=False) ]
Account Verified Paid Col1 Col2 Col3
0 41 True True ... ... ...
2 41 True True NaN NaN NaN
您可能想按帐户"的顺序对最后两个数据框进行排序:
You might want to sort the last two dataframes in order of 'Account':
df[0][ df[0].duplicated('Account', keep=False) ].sort_values('Account')
注意:拥有多个数据框的列表df[i]
并对其进行迭代并不是一种熊猫式习语.通常,最好合并或合并数据帧,并增加一列以区分它们来自何处. (同样,效率更高,我们只需执行一次groupby
,apply
,drop_duplicates
等)
Note: it's not very pandas-idiom to have a list df[i]
of multiple dataframes and iterate over it. Generally better to merge or concat the dataframes, and have one extra column to distinguish where they came from. (Also more efficient, we only need to do groupby
, apply
, drop_duplicates
etc. once)
这篇关于在多个数据框中保留重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!