按列串联多个文件的最快方法-Python [英] Fastest way to concatenate multiple files column wise - Python

查看：55 发布时间：2021/4/30 18:37:00 python shell text-files delimiter paste

本文介绍了按列串联多个文件的最快方法-Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最快的方法是按列连接多个文件(在Python中)?

假设我有两个文件，每个文件包含1,000,000,000行，每行约200个UTF8字符.

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

方法1:用粘贴

我可以通过在外壳中使用 paste 来在linux系统下连接两个文件，并且可以使用 os.system 作弊，即:

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):
    file1 = os.path.join(file_path, file1)
    file2 = os.path.join(file_path, file2)
    output = os.path.join(output_path, output)
    if not os.path.exists(output):
        os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

方法2:将嵌套的上下文管理器与 zip 一起使用:

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):
    with open(output, 'wb') as fout:
        with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
            for line1, line2 in zip(fin1, fin2):
                fout.write(line1 + '\t' + line2)

方法3:使用 fileinput

fileinput 是否并行遍历文件?还是会依次依次遍历每个文件?

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

如果是前者，我认为它看起来像这样:

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):
    with fileinput.input(files=(file1, file2)) as f:
        for line in f:
            line1, line2 = process(line)
            fout.write(line1 + '\t' + line2)

方法4 :将它们视为 csv

with open(output, 'wb') as fout:
    with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
        writer = csv.writer(w)
        reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
        for line1, line2 in zip(reader1, reader2):
          writer.writerow(line1 + '\t' + line2)

给出数据大小，哪一个最快?

Given the data size, which would be the fastest?

为什么一个选择另一个?我会丢失还是添加信息?

Why would one choose one over the other? Would I lose or add information?

对于每种方法，我如何选择除，或 \ t 之外的其他分隔符?

For each method how would I choose a different delimiter other than , or \t?

还有其他方法可以明智地实现相同的串联列吗?他们快吗?

Are there other ways of achieving the same concatenation column wise? Are they as fast?

推荐答案

在所有四种方法中，我将采用第二种方法.但是你要照顾好实施中的小细节.(经过一些改进，它花费了 0.002秒，而原始实现过程花费了 6秒；我正在处理的文件为100万行；但是如果该文件大了1K倍，因为我们几乎没有用完内存.

From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

与原始实现相比的变化:

Changes from the original implementation:

如果可能，请使用迭代器，否则将消耗内存，因此您必须立即处理整个文件.(主要是如果您使用的是python 2，而不是使用zip，请使用itertools.izip)
连接字符串时，请使用％s％s" .format()或类似的字符串；否则，您每次都会生成一个新的字符串实例.
无需在for内部逐行编写.您可以在写入操作中使用迭代器.
小型缓冲区非常有趣，但是如果使用迭代器，则差异很小，但是如果尝试一次获取所有数据(例如，我们放入f1.readlines(1024 * 1000)，则速度要慢得多)).

示例:

def concat_iter(file1, file2, output):
    with open(output, 'w', 1024) as fo, \
        open(file1, 'r') as f1, \
        open(file2, 'r') as f2:
        fo.write("".join("{}\t{}".format(l1, l2) 
           for l1, l2 in izip(f1.readlines(1024), 
                              f2.readlines(1024))))

Profiler原始解决方案.

Profiler original solution.

我们看到最大的问题在于写和zip(主要是因为不使用迭代器，而不得不处理/处理内存中的所有文件).

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)
    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}
    1    1.072    1.072    1.072    1.072 {zip}

分析器:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 
     3731 function calls in 0.002 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)
    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)
 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}

在python 3中甚至更快，因为迭代器是内置的，我们不需要导入任何库.

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

很高兴看到内存消耗和文件系统访问确认了我们之前说过的话:

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0


$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696

这篇关于按列串联多个文件的最快方法-Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

按列串联多个文件的最快方法-Python [英] Fastest way to concatenate multiple files column wise - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

按列串联多个文件的最快方法-Python [英] Fastest way to concatenate multiple files column wise - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭