如何根据字段合并两个CSV文件并在每个记录上保留相同数量的属性? [英] How do I merge two CSV files based on field and keep same number of attributes on each record?

查看:847
本文介绍了如何根据字段合并两个CSV文件并在每个记录上保留相同数量的属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据每个文件中的特定字段合并两个CSV文件。



file1.csv

  id ,attr1,attr2,attr3 
1,True,7,Purple
2,False,19.8,Cucumber
3,False,-0.5,因为它有一个
4,True,2,Nope
5,True,4.0,Tuesday
6,False,1,Failure

file2.csv

  id,attr4,attr5,attr6 
2,python,500000.12,False
5,程序,3,True
3,另一个字符串 ,-5,False

这是我使用的代码:



import csv
从集合import OrderedDict

with open('file2.csv','r')as f2:
reader = csv.reader(f2)
fields2 = next(reader,None)#跳过标题
dict2 = {row [0]:row [1: b
$ b with open('file1.csv','r')as f1:
reader = csv.reader(f1)
fields1 = next(reader,None)#跳过标题
dict1 = OrderedDict(读取器中行的(row [0],row [1:]))

result = OrderedDict()
for d in(dict1,dict2) :
for key,value in d.iteritems():
result.setdefault(key,[])extend(value)

with open('merged.csv' ,'wb')as f:
w = csv.writer(f)
for key,value in result.iteritems():
w.writerow([key] + value)

我得到这样的输出,它合适的合并,但没有相同数量的属性的所有行:

  1,True,7,Purple 
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,带有逗号的字符串,因为它有一个,另一个字符串,-5,False
4,True,2,Nope
5,True,星期二,节目,3,True
6,False,1,失败

file2 不会在 file1 中的每个 id 我希望输出在合并文件中具有来自 file2 的空字段。例如, id 1将如下所示:

  1,True ,7,Purple ,,, 

如何将空字段添加到没有数据 file2 ,以便合并的CSV中的所有记录具有相同的属性数量?

解决方案

如果我们不使用 pandas ,我会重构

  import csv 
from collections import OrderedDict

filenames =file1.csv,file2.csv
data = OrderedDict
fieldnames = []
文件名中的文件名:
with open(filename,rb)as fp:#python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
对于读取器中的行:
data.setdefault(row [id],{})。update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open(merged.csv,wb)as fp:
writer = csv.writer(fp)
writer.writerow (fieldnames)
for data.itervalues():
writer.writerow([field.get(field,'')for fieldnames])

它提供

  id,attr1,attr2 ,attr3,attr4,attr5,attr6 
1,True,7,Purple ,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False, 带有逗号的字符串,因为它有一个,另一个字符串,-5,False
4,True,2,Nope ,,,
5,True,4.0,星期二,程序, True
6,False,1,Failure ,,,

c $ c> pandas 等同物将是类似

  df1 = pd.read_csv(file1。 csv)
df2 = pd.read_csv(file2.csv)
merged = df1.merge(df2,on =id,how =outer)。fillna b $ b merged.to_csv(merged.csv,index = False)

到我的眼睛,意味着你可以花更多的时间处理你的数据,更少的时间重新发明轮子。


I am attempting to merge two CSV files based on a specific field in each file.

file1.csv

id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"

file2.csv

id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False

This is the code I am using:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    fields2 = next(reader,None) # Skip headers
    dict2 = {row[0]: row[1:] for row in reader}

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    fields1 = next(reader,None) # Skip headers
    dict1 = OrderedDict((row[0], row[1:]) for row in reader)

result = OrderedDict()
for d in (dict1, dict2):
    for key, value in d.iteritems():
        result.setdefault(key, []).extend(value)

with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for key, value in result.iteritems():
        w.writerow([key] + value)

I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:

1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure

file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:

1,True,7,Purple,,,

How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?

解决方案

If we're not using pandas, I'd refactor to something like

import csv
from collections import OrderedDict

filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
    with open(filename, "rb") as fp: # python 2
        reader = csv.DictReader(fp)
        fieldnames.extend(reader.fieldnames)
        for row in reader:
            data.setdefault(row["id"], {}).update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
    writer = csv.writer(fp)
    writer.writerow(fieldnames)
    for row in data.itervalues():
        writer.writerow([row.get(field, '') for field in fieldnames])

which gives

id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,

For comparison, the pandas equivalent would be something like

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)

which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.

这篇关于如何根据字段合并两个CSV文件并在每个记录上保留相同数量的属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆