使用python(pandas)有条件合并CSV文件 [英] Conditional merge for CSV files using python (pandas)

查看：239 发布时间：2020/7/12 0:04:00 python csv pandas data-processing

本文介绍了使用python(pandas)有条件合并CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试合并具有相同架构的>=2文件.
这些文件将包含重复的条目，但行将不相同，例如:

I am trying to merge >=2 files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:

file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111

file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282

Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282

如果您发现: 9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999基于store_id和phone是相似的，但由于地址更具描述性，因此我从file2中选择了它.

If you noticed : 9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999 are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.

store_id+phone_number是我的复合主键，用于查找位置并查找重复项(在上面的示例中，store_id足以找到它，但我需要基于多个列值的键)

store_id+phone_number was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)

问题:
-我需要合并具有相同架构但具有重复行的多个CSV文件.
-行级合并应该具有根据行的值选择行的特定值的逻辑.就像从文件1提取的电话和从文件2提取的地址一样.
-1或多个列值的组合将定义行是否重复.

Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.

这可以通过使用熊猫来实现吗?

Can this be achieved using pandas?

推荐答案

将它们粉碎在一起的一种方法是使用merge(在store_id和number上，如果这些是索引，那么这将是联接而不是合并):

One way to smash them together is to use merge (on store_id and number, if these are the index then this would be a join rather than a merge):

In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')

In [12]: res
Out[12]:
   store_id     address_x        phone           address_y
0      9191  9827 Park st    999999999  9827 Park st Apt82
1      8181  543 Hello st   1111111111                 NaN
2      7171           NaN  87282728282         912 John st

然后您可以使用 where 选择address_y(如果存在)，否则选择address_x:

You can then use where to select address_y if it exists, otherwise address_x:

In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)

In [14]: del res['address_x'], res['address_y']

In [15]: res
Out[15]: 
   store_id        phone             address
0      9191    999999999  9827 Park st Apt82
1      8181   1111111111        543 Hello st
2      7171  87282728282         912 John st

这篇关于使用python(pandas)有条件合并CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用python(pandas)有条件合并CSV文件 [英] Conditional merge for CSV files using python (pandas)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用python(pandas)有条件合并CSV文件 [英] Conditional merge for CSV files using python (pandas)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭