在Python中用不同的列合并CSV [英] Merge CSVs in Python with different columns

查看:548
本文介绍了在Python中用不同的列合并CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数百个大型CSV文件,我希望将其合并为一个文件.但是,并非所有CSV文件都包含所有列.因此,我需要根据列名而不是列位置来合并文件.

I have hundreds of large CSV files that I would like to merge into one. However, not all CSV files contain all columns. Therefore, I need to merge files based on column name, not column position.

请注意:在合并的CSV中,对于来自不包含该单元格列的行的单元格,其值应为空.

Just to be clear: in the merged CSV, values should be empty for a cell coming from a line which did not have the column of that cell.

我无法使用pandas模块,因为它使我的内存不足.

I cannot use the pandas module, because it makes me run out of memory.

有没有可以执行此操作的模块或一些简单的代码?

Is there a module that can do that, or some easy code?

推荐答案

csv.DictReadercsv.DictWriter类应该运行良好(请参阅

The csv.DictReader and csv.DictWriter classes should work well (see Python docs). Something like this:

import csv
inputs = ["in1.csv", "in2.csv"]  # etc

# First determine the field names from the top line of each input file
# Comment 1 below
fieldnames = []
for filename in inputs:
  with open(filename, "r", newline="") as f_in:
    reader = csv.reader(f_in)
    headers = next(reader)
    for h in headers:
      if h not in fieldnames:
        fieldnames.append(h)

# Then copy the data
with open("out.csv", "w", newline="") as f_out:   # Comment 2 below
  writer = csv.DictWriter(f_out, fieldnames=fieldnames)
  for filename in inputs:
    with open(filename, "r", newline="") as f_in:
      reader = csv.DictReader(f_in)  # Uses the field names in this file
      for line in reader:
        # Comment 3 below
        writer.writerow(line)

来自上方的评论:

  1. 您需要在DictWriter之前预先指定所有可能的字段名称,因此您需要循环浏览所有CSV文件两次:一次查找所有标头,一次读取数据.没有更好的解决方案,因为在DictWriter可以写第一行之前,所有标头都需要知道.使用集而不是列表,这部分会更有效(列表上的in运算符比较慢),但是对于几百个标头而言,它并没有太大的区别.集合也将失去列表的确定性顺序-每次运行代码时,列将以不同的顺序出现.
  2. 上面的代码适用于Python 3,在没有newline=""的CSV模块中发生了奇怪的事情.对于Python 2删除它.
  3. 此时,line是一个字典,其字段名称为键,而列数据为值.您可以在DictReaderDictWriter构造函数中指定如何处理空白或未知值.
  1. You need to specify all the possible field names in advance to DictWriter, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known before DictWriter can write the first line. This part would be more efficient using sets instead of lists (the in operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code.
  2. The above code is for Python 3, where weird things happen in the CSV module without newline="". Remove this for Python 2.
  3. At this point, line is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in the DictReader and DictWriter constructors.

此方法不应耗尽内存,因为它永远不会一次加载整个文件.

This method should not run out of memory, because it never has the whole file loaded at once.

这篇关于在Python中用不同的列合并CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆