如何比较两个 pandas 数据帧并删除一个文件中的重复项而又不追加其他文件中的数据 [英] How to compare two pandas dataframes and remove duplicates on one file without appending data from other file

查看:80
本文介绍了如何比较两个 pandas 数据帧并删除一个文件中的重复项而又不追加其他文件中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pandas数据帧比较两个csv文件.一个是母版表,它将每天向其添加数据(test_master.csv).第二个是每日报告(test_daily.csv),其中包含我要附加到test_master.csv的数据.

I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.

我正在从这些文件中创建两个熊猫数据框:

I am creating two pandas dataframes from these files:

import pandas as pd

dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)

我希望将每日列表与主列表进行比较,以查看日列表中是否有重复的行已经存在于主列表中.如果是这样,我希望他们从dfdaily中删除重复项.然后,我想将此非重复数据写入dfmaster.

I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.

重复的数据将始终是整行.我的计划是逐行遍历工作表以进行比较.

The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.

我意识到我可以将我的日常数据附加到dfmaster数据框中,并使用drop_duplicates删除重复项.我无法弄清楚如何删除dfdaily数据框中的重复项.而且我需要能够将dfdaily数据写回到test_daily.csv(或另一个新文件)中,而无需重复数据.

I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.

这是数据框的外观示例.

Here is an example of what the dataframes could look like.

test_master.csv

test_master.csv

  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2

test_daily.csv

test_daily.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

所需的输出是:

test_master.csv

test_master.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

test_daily.csv

test_daily.csv

+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
|       10 |       11 |       12 |
|       13 |       14 |       15 |
+----------+----------+----------+

任何帮助将不胜感激!

Any help would be greatly appreciated!

编辑

我错误地认为集合差异问题的解决方案解决了我的问题.在某些情况下,我遇到了那些解决方案不起作用的情况.我相信这与索引号标签有关,如下面Troy D的评论中所述. Troy D的解决方案是我现在正在使用的解决方案.

I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.

推荐答案

尝试一下:

我创建了2个索引,然后将2-4行设置为重复:

I crate 2 indexes, and then set rows 2-4 to be duplicates:

import numpy as np

test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values

print(test_master)
print(test_daily)

输出:

      A         B         C
0  0.009322  0.330057  0.082956
1  0.197500  0.010593  0.356774
2  0.147410  0.697779  0.421207
      A         B         C
0  0.643062  0.335643  0.215443
1  0.009322  0.330057  0.082956
2  0.197500  0.010593  0.356774
3  0.147410  0.697779  0.421207
4  0.973867  0.873358  0.502973

然后,添加一个多索引级别以标识哪个数据来自哪个数据帧:

Then, add a multiindex level to identify which data is from which dataframe:

test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)

现在按照您的建议合并并删除重复项:

Now merge as you suggested and drop duplicates:

merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)

输出:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
  master  0.009322  0.330057  0.082956
1 master  0.197500  0.010593  0.356774
2 master  0.147410  0.697779  0.421207
4 daily   0.973867  0.873358  0.502973

您会看到组合的数据框,其中数据的原点在索引中标记.现在,只需切片以获取每日数据:

There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:

idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])

输出:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
4 daily   0.973867  0.873358  0.502973

这篇关于如何比较两个 pandas 数据帧并删除一个文件中的重复项而又不追加其他文件中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆