将STDIN拆分为多个文件(并在可能的情况下将其压缩) [英] split STDIN to multiple files (and compress them if possible)

查看:78
本文介绍了将STDIN拆分为多个文件(并在可能的情况下将其压缩)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序(gawk),可将数据流输出到其STDOUT. 实际处理的数据为10s GB. 我不想将其持久保存在单个文件中,而是将其拆分为多个块,并可能在保存之前对每个文件进行一些额外的处理(例如压缩).

I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.

我的数据是记录的序列,我不想拆分以将记录减少一半. 每条记录均与以下正则表达式匹配:

my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:

^\{index.+?\}\}\n\{.+?\}$

或者为了简单起见,可以假设两行(第一行是不均匀的,即使从流的开头开始编号也总是)记录.

or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.

我可以吗

  • 使用一些标准的linux命令通过定义更好的块大小来拆分STDIN?给定记录变量的大小并不能保证一定要精确.另外,如果定义,则仅记录数.按大小是不可能的
  • 压缩每个块并将其存储在文件中(名称中带有一些编号,如001、002等.)

我已经意识到 GNU并行csplit但不知道如何将它们放在一起. 如果可以在不编写自定义perl脚本的情况下实现上述功能,那就太好了.但是,这可能是另一个不得已的解决方案,但又不确定如何最好地实现它.

I've became aware of commands like GNU parallel or csplit but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.

推荐答案

GNU Parallel可以将stdin拆分成记录块.这会将stdin分成50 MB的块,每条记录为2行.每个块都将传递到gzip并压缩为名称[块编号] .gz:

GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:

cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz

如果您知道第二行永远不会以"{index"开头,则可以使用"{index"作为记录开头:

If you know your second line will never start with '{index' you can use '{index' as the record start:

cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz

然后您可以通过以下方式轻松测试拆分是否正确:

You can then easily test if the splitting went correctly by:

parallel zcat {} \| wc -l ::: *.gz

除非记录的长度都相同,否则您可能会看到不同数量的行,但行数都是偶数.

Unless your records are all the same length you will probably see a different number of lines, but all even.

观看介绍视频以获取快速介绍: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

遍历该教程(man parallel_tutorial).您命令行 会为此而爱你的.

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

这篇关于将STDIN拆分为多个文件(并在可能的情况下将其压缩)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆