GNU Parallel:将文件拆分为子代 [英] GNU Parallel: split file into children

查看:103
本文介绍了GNU Parallel:将文件拆分为子代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标

使用GNU Parallel将一个较大的.gz文件拆分为子代.由于服务器具有16个CPU,因此请创建16个子代.每个孩子最多应包含N行.在此,N = 104,214,420线.儿童应为.gz格式.

Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.

输入文件

  • 名称:file1.fastq.gz
  • 大小:39 GB
  • 行数:1,667,430,708(未压缩)

硬件

  • 36 GB内存
  • 16个CPU
  • HPCC环境(我不是管理员)

代码

版本1

zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

三天后,工作尚未完成.split_log.txt为空.在输出目录中看不到任何子级.日志文件表明Parallel已将-block-size 从1 MB(默认值)增加到超过2 GB.这启发了我将代码更改为版本2.

Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.

版本2

# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.

zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

作业已运行约2个小时.split_log.txt为空.在输出目录中尚无子级可见.到目前为止,日志文件显示以下警告:

The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:

parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.

问题

  1. 如何改进我的代码?
  2. 有没有更快的方法来实现这一目标?

推荐答案

让我们假设该文件是fastq文件,因此记录大小为4行.

Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.

您使用 -L 4 告诉GNU Parallel.

You tell that to GNU Parallel with -L 4.

在fastq文件中,顺序无关紧要,因此您希望将n * 4行的块传递给子代.

In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.

要有效地执行此操作,请使用-pipe-part ,但-pipe-part 不适用于压缩文件,也不适用于-L ,所以您必须满足于-pipe .

To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.

zcat file1.fastq.gz |
  parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

这会将一个块传递给16个孩子,一个块默认为1 MB,该块在记录边界(即4行)处被切掉.它将为每个块运行一个作业.但是,您真正想要的是将输入总共仅传递给16个作业,并且您可以进行轮循.不幸的是,-round-robin 中存在随机性的元素,因此-resume-failed 将不起作用:

This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:

zcat file1.fastq.gz |
  parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"

parallel 会努力跟上16个gzip,但是您应该能够以100-200 MB/s的速度压缩.

parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.

现在 if 如果您将fastq文件解压缩,我们可以更快地完成它,但是我们将不得不作弊:通常在fastq文件中,您将拥有一个以相同字符串开头的seqname:

Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

这里是 @ EAS54_6_R .不幸的是,这也是质量行中的有效字符串(这是 really 愚蠢的设计),但是实际上,我们看到以 @ EAS54_6_R .只是没有发生.

Here it is @EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with @EAS54_6_R. It just does not happen.

我们可以利用它来发挥我们的优势,因为现在您可以使用 \ n ,然后使用 @ EAS54_6_R 作为记录分隔符,然后我们可以使用--pipe-part .额外的好处是订单将保持不变.在这里,您必须将块大小设置为 file1-fastq 大小的1/16:

We can use that to our advantage, because now you can use \n followed by @EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:

parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

如果您使用GNU Parallel 20161222,则GNU Parallel可以为您执行该计算.-block -1 的意思是:选择一个块大小,以便您可以为16个作业位中的每一个分配一个块.

If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.

parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

此处GNU Parallel不会成为限制因素:它可以轻松传输20 GB/s.

Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.

打开文件以查看重新启动值应该是很烦人的,因此这在大多数情况下都可以使用:

It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:

parallel -a file1.fastq --pipe-part --block -1 -j16 
--regexp --recend '\n' --recstart '@.*\n[A-Za-z\n\.~]'
my_command

在这里,我们假设行将像这样开始:

Here we assume that the lines will start like this:

@
[A-Za-z\n\.~]
anything
anything

即使您有几行以'@'开头的质量行,也永远不会跟以[A-Za-z \ n.〜]开头的行,因为质量行始终后跟seqname行,以@开头.

Even if you have a few quality lines starting with '@', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with @.

您也可以将块大小设置得很大,以至于它对应于未压缩文件的1/16,但这不是一个好主意:

You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:

  • 您必须能够将完整的未压缩文件保留在RAM中.
  • 最后一个 gzip 将仅在读取完最后一个字节之后启动(然后第一个 gzip 可能会在那时完成).
  • You would have to be able to keep the full uncompressed file in RAM.
  • The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).

通过将记录数设置为104214420(使用-N),这基本上就是您正在做的事情,您的服务器可能正在努力将150 GB的未压缩数据保留在其36 GB的RAM中.

By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.

这篇关于GNU Parallel:将文件拆分为子代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆