按列串联多个文件的最快方法-Python [英] Fastest way to concatenate multiple files column wise - Python

查看:55
本文介绍了按列串联多个文件的最快方法-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最快的方法是按列连接多个文件(在Python中)?

假设我有两个文件,每个文件包含1,000,000,000行,每行约200个UTF8字符.

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

方法1:粘贴

我可以通过在外壳中使用 paste 来在linux系统下连接两个文件,并且可以使用 os.system 作弊,即:

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):
    file1 = os.path.join(file_path, file1)
    file2 = os.path.join(file_path, file2)
    output = os.path.join(output_path, output)
    if not os.path.exists(output):
        os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

方法2:将嵌套的上下文管理器与 zip 一起使用:

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):
    with open(output, 'wb') as fout:
        with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
            for line1, line2 in zip(fin1, fin2):
                fout.write(line1 + '\t' + line2)

方法3:使用 fileinput

fileinput 是否并行遍历文件?还是会依次依次遍历每个文件?

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

如果是前者,我认为它看起来像这样:

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):
    with fileinput.input(files=(file1, file2)) as f:
        for line in f:
            line1, line2 = process(line)
            fout.write(line1 + '\t' + line2)

方法4 :将它们视为 csv

with open(output, 'wb') as fout:
    with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
        writer = csv.writer(w)
        reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
        for line1, line2 in zip(reader1, reader2):
          writer.writerow(line1 + '\t' + line2)

给出数据大小,哪一个最快?

Given the data size, which would be the fastest?

为什么一个选择另一个?我会丢失还是添加信息?

Why would one choose one over the other? Would I lose or add information?

对于每种方法,我如何选择除 \ t 之外的其他分隔符?

For each method how would I choose a different delimiter other than , or \t?

还有其他方法可以明智地实现相同的串联列吗?他们快吗?

Are there other ways of achieving the same concatenation column wise? Are they as fast?

推荐答案

在所有四种方法中,我将采用第二种方法.但是你要照顾好实施中的小细节.(经过一些改进,它花费了 0.002秒,而原始实现过程花费了 6秒;我正在处理的文件为100万行;但是如果该文件大了1K倍,因为我们几乎没有用完内存.

From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

与原始实现相比的变化:

Changes from the original implementation:

  • 如果可能,请使用迭代器,否则将消耗内存,因此您必须立即处理整个文件.(主要是如果您使用的是python 2,而不是使用zip,请使用itertools.izip)
  • 连接字符串时,请使用%s%s" .format()或类似的字符串;否则,您每次都会生成一个新的字符串实例.
  • 无需在for内部逐行编写.您可以在写入操作中使用迭代器.
  • 小型缓冲区非常有趣,但是如果使用迭代器,则差异很小,但是如果尝试一次获取所有数据(例如,我们放入f1.readlines(1024 * 1000),则速度要慢得多)).

示例:

def concat_iter(file1, file2, output):
    with open(output, 'w', 1024) as fo, \
        open(file1, 'r') as f1, \
        open(file2, 'r') as f2:
        fo.write("".join("{}\t{}".format(l1, l2) 
           for l1, l2 in izip(f1.readlines(1024), 
                              f2.readlines(1024))))

Profiler原始解决方案.

Profiler original solution.

我们看到最大的问题在于写和zip(主要是因为不使用迭代器,而不得不处理/处理内存中的所有文件).

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)
    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}
    1    1.072    1.072    1.072    1.072 {zip}

分析器:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 
     3731 function calls in 0.002 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)
    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)
 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}

在python 3中甚至更快,因为迭代器是内置的,我们不需要导入任何库.

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

很高兴看到内存消耗和文件系统访问确认了我们之前说过的话:

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0


$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696

这篇关于按列串联多个文件的最快方法-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆