使用Python将大文本文件分割成较小的文本文件 [英] Splitting large text file into smaller text files by line numbers using Python
问题描述
我有一个文本文件,说真的__ig_file.txt包含:
I have a text file say really_big_file.txt that contains:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
我想编写一个Python脚本,将true_big_file.txt分成更小的文件,每行有300行。例如,small_file_300.txt要有1-300行,small_file_600有行301-600,等等,直到有足够的小文件包含大文件中的所有行。
I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.
我会感谢任何关于使用Python完成此操作的最简单方法的建议
I would appreciate any suggestions on the easiest way to accomplish this using Python
推荐答案
使用 itertools
石斑鱼食谱:
Using itertools
grouper recipe:
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_{0}'.format(i * n), 'w') as fout:
fout.writelines(g)
这种方法的优点在于将每一行存储在一个列表中,就是它与逐行列表一起工作,所以它不必存储每个 small_file
一次存入内存。
The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file
into memory at once.
请注意,这种情况下最后一个文件将是 small_file_100200
,但只会直到行100000
。这是因为 fillvalue =''
,这意味着当我没有任何更多的行写入时,我会写出没有的文件,因为组别大小不平等分。您可以通过写入临时文件来解决这个问题,然后重新命名它,而不是像我一样命名它。以下是如何完成的。
Note that the last file in this case will be small_file_100200
but will only go until line 100000
. This happens because fillvalue=''
, meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
这次 fillvalue = None
,我经过每一行检查无
,当它发生时,我知道进程已经完成,所以我从 j $ c $中减去
1
c>不计算填充,然后写入文件。
This time the fillvalue=None
and I go through each line checking for None
, when it occurs, I know the process has finished so I subtract 1
from j
to not count the filler and then write the file.
这篇关于使用Python将大文本文件分割成较小的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!