如何将大文本文件拆分为行数相同的小文件? [英] How can I split a large text file into smaller files with an equal number of lines?

查看:26
本文介绍了如何将大文本文件拆分为行数相同的小文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的(按行数)纯文本文件,我想将其拆分为更小的文件,也按行数.因此,如果我的文件有大约 200 万行,我想将其拆分为 10 个包含 200k 行的文件,或 100 个包含 20k 行的文件(加上一个文件与其余部分;均匀分割无关紧要).

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

我可以在 Python 中很容易地做到这一点,但我想知道是否有任何忍者方法可以使用 Bash 和 Unix 实用程序来做到这一点(而不是手动循环和计数/分区行).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).

推荐答案

看看split命令:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

你可以这样做:

split -l 200000 filename

这将创建名为 xaa xab xac ...

which will create files each with 200000 lines named xaa xab xac ...

另一种选择,按输出文件的大小拆分(仍然在换行符处拆分):

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

创建类似 output_prefix01 output_prefix02 output_prefix03 ... 的文件,每个文件的最大大小为 20 兆字节.

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

这篇关于如何将大文本文件拆分为行数相同的小文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆