最快的在Python中组合异构CSV文件的I/O有效方式 [英] Fastest & I/O efficient way to combine heterogeneous CSV files in Python

查看:119
本文介绍了最快的在Python中组合异构CSV文件的I/O有效方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定十个1MB的csv文件,每个文件的布局略有不同,我需要将它们组合成具有相同标头的标准化单个文件.空字符串适合空值.

Given ten 1MB csv files, each with slightly different layouts, I need to combine them into a normalized single file with the same header. Empty string is fine for nulls.

列示例:

1. FIELD1, FIELD2, FIELD3
2. FIELD2, FIELD1, FIELD3
3. FIELD1, FIELD3, FIELD4
4. FIELD3, FIELD4, FIELD5, FIELD6
5. FIELD2

输出看起来像(尽管顺序并不重要,但我的代码将它们按发现顺序排列):

The output would look like (although order not important, my code puts them in order discovered):

FIELD1, FIELD2, FIELD3, FIELD4, FIELD5, FIELD6

因此,基本上,字段可以按任何顺序排列,字段可能丢失或以前从未见过的新字段.所有都必须包含在输出文件中.不需要连接,最后零件中的数据行数必须等于输出中的行数.

So basically the fields can come in any order, fields may be missing, or new fields not seen before. All must be included in the output file. No joining required, in the end the count of data rows in the parts must equal the count of rows in the output.

将所有10MB读取到内存中就可以了.不知道如何使用100MB来做到这一点.如果需要,您也可以一次打开所有文件.很多文件可用,有可用的内存,但是它将在NAS上运行,因此它需要高效(没有太多的NAS操作).

Reading all 10MB into memory is OK. Somehow using 100MB to do it would not be. You can open all files at once if needed as well. Lots of file hands, memory available, but it will be running against a NAS so it needs to be efficient for that (not too many NAS ops).

我现在所拥有的方法是将每个文件读入列列表,在发现新列时建立新的列列表,然后将其全部写到单个文件中.不过,我希望有人能提供一些更聪明的选择,因为我在此过程中遇到了瓶颈,因此任何缓解措施都是有帮助的.

The method I have right now is to read each file into columns lists, build new columns lists as I discover new columns then write it all out to a single file. I'm hoping someone has something a bit more clever, though, as I'm bottlenecking on this process so any relief is helpful.

我有示例文件此处尝试.我将发布当前代码作为可能的答案.使用本地磁盘在服务器上运行时(最快的内核,大量的内存)寻找最快的时间.

I have samples files here if anyone wants to try. I'll post my current code as a possible answer. Looking for the fastest time when I run it on my server (lots of cores, lots of memory) using local disk.

推荐答案

对象.第一个传递收集所有文件中使用的头文件集,然后传递两个然后根据这些头在数据之间进行复制.

Use a two-pass approach with csv.DictReader() and csv.DictWriter() objects. Pass one collects the set of headers used across all the files, and pass two then copies across data based on the headers.

收集标头很简单,只要访问读取器对象上的fieldnames属性就足够了:

Collecting headers is as simple as just accessing the fieldnames attribute on the reader objects is enough:

import csv
import glob

files = []
readers = []
fields = set()

try:
    for filename in glob.glob('in*.csv'):
        try:
            fileobj = open(filename, 'rb')
        except IOError:
            print "Failed to open {}".format(filename)
            continue
        files.append(fileobj)  # for later closing

        reader = csv.DictReader(fileobj)
        fields.update(reader.fieldnames)  # reads the first row
        readers.append(reader)

    with open('result.csv', 'wb') as outf:
        writer = csv.DictWriter(outf, fieldnames=sorted(fields))
        writer.writeheader()
        for reader in readers:
            # copy across rows; missing fields will be left blank
            for row in reader:
                writer.writerow(row)
finally:
    # close out open file objects
    for fileobj in files:
        fileobj.close()

每个阅读器都会生成一个字典,其中包含所有字段的子集,但是DictWriter将使用restval参数的值(像我在此处一样省略时默认为'')来填充每个缺少的值键.

Each reader produces a dictionary with a subset of all fields, but DictWriter will use the value of the restval argument (defaulting to '' when omitted like I did here) to fill in the value of each missing key.

我在这里假设使用Python 2;如果是Python 3,则可以使用 ExitStack() 来管理阅读器的打开文件;在文件模式中省略b,并在所有打开的调用中添加newline=''参数,以将换行处理留给CSV模块.

I assumed Python 2 here; if this is Python 3 you could use an ExitStack() to manage the open files for the readers; omit the b from the file modes and add a newline='' argument to all open calls to leave newline handling to the CSV module.

以上代码仅使用缓冲区读取和写入行;行大多是一次从一个开放的读取器移动到写入器,一次一次.

The above code only ever uses a buffer to read and write rows; rows are mostly moved from one open reader to the writer one row at a time at a time.

很遗憾,我们不能将writer.writerows(reader)用作 实现首先将reader中的所有内容转换为列表列表,然后再将其传递给基础csv.writer.writerows()方法,请参见

Unfortunately, we cannot use writer.writerows(reader) as the DictWriter.writerows() implementation first converts everything in reader to a list of lists before passing it on to the underlying csv.writer.writerows() method, see issue 23495 in the Python bug tracker.

这篇关于最快的在Python中组合异构CSV文件的I/O有效方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆