使用Python将大文本文件分割成较小的文本文件 [英] Splitting large text file into smaller text files by line numbers using Python

查看:645
本文介绍了使用Python将大文本文件分割成较小的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,说真的__ig_file.txt包含:

I have a text file say really_big_file.txt that contains:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写一个Python脚本,将true_big_file.txt分成更小的文件,每行有300行。例如,small_file_300.txt要有1-300行,small_file_600有行301-600,等等,直到有足够的小文件包含大文件中的所有行。

I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.

我会感谢任何关于使用Python完成此操作的最简单方法的建议

I would appreciate any suggestions on the easiest way to accomplish this using Python

推荐答案

使用 itertools 石斑鱼食谱:

Using itertools grouper recipe:

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

这种方法的优点在于将每一行存储在一个列表中,就是它与逐行列表一起工作,所以它不必存储每个 small_file 一次存入内存。

The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.

请注意,这种情况下最后一个文件将是 small_file_100200 ,但只会直到行100000 。这是因为 fillvalue ='',这意味着当我没有任何更多的行写入时,我会写出没有的文件,因为组别大小不平等分。您可以通过写入临时文件来解决这个问题,然后重新命名它,而不是像我一样命名它。以下是如何完成的。

Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

这次 fillvalue = None ,我经过每一行检查,当它发生时,我知道进程已经完成,所以我从 j 1 c>不计算填充,然后写入文件。

This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.

这篇关于使用Python将大文本文件分割成较小的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆