将大文本文件(约 50GB)拆分为多个文件 [英] Split large text file(around 50GB) into multiple files
问题描述
我想将一个大约 50GB 的大文本文件拆分为多个文件.文件中的数据是这样的-[x=0-9之间的任意整数]
I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............
文件中可能有几十亿行,我想写例如每个文件 30/40 百万行.我想步骤是-
There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-
- 我要打开文件
- 然后使用 readline() 必须逐行读取文件并同时写入新文件
- 一旦达到最大行数,它就会创建另一个文件并又开始写了.
我想知道,如何以一种高效且快速的方式将所有这些步骤放在一起.我在堆栈中看到了一些例子,但没有一个完全帮助我真正需要的.如果有人能帮助我,我将不胜感激.
I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.
推荐答案
这个有效的解决方案使用了 shell 中可用的 split
命令.由于作者已经接受了非python解决方案的可能性,请不要投票.
This working solution uses split
command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.
首先,我创建了一个包含 1000M 条目 (15 GB) 的测试文件
First, I created a test file with 1000M entries (15 GB) with
awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt
然后我使用了split
:
split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t
生成一组名称为 t00
-t33
的 34 个小文件需要 5 分钟.33 个文件是每个 458 MB,最后一个 t33
是 153 MB.
It took 5 min to produce a set of 34 small files with names t00
-t33
. 33 files are 458 MB each and the last t33
is 153 MB.
这篇关于将大文本文件(约 50GB)拆分为多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!