如何使用GNU并行写多核排序 [英] How to write multicore sorting using GNU Parallel

查看：177 发布时间：2016/8/3 12:04:36 multithreading bash sorting parallel-processing gnu-parallel

本文介绍了如何使用GNU并行写多核排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

的GNU平行为使用一个或多个计算机并行执行作业的壳工具

GNU parallel is a shell tool for executing jobs in parallel using one or more computers

例如，如果我想要写的多核版本厕所我可以这样做：

For example, if I want to write a multicore version of wc I could do:

cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'

我的问题是如何做到使用并行排序？我知道我应该做的是管道平行的一个合并排序的文件命令（就像在合并排序最后合并）的结果，但我不知道该怎么做。

My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.

推荐答案

有几个方法可以做到这一点。

There's a few ways to do this.

让我们得到一个简单的文本文件一起玩：

Let's get a simple text file to play with:

$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
   tr " " "\n" | tr "[A-Z]" "[a-z]" | 
   sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt

$ wc moby-dick-words.txt

215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt

real    0m0.260s
user    0m0.462s
sys 0m0.004s

我们可以做排序的文本块，一次说10000字，并推迟一些辛苦，连续工作到汇合（排序-m ）部分：

We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m) part:

$ mkdir tmp
$ time (
  cd tmp;
  split -l 1000 ../moby-dick-words.txt;
  parallel sort {} -o {}.sorted ::: x*;
  sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
  rm x* )

real    0m0.787s
user    0m0.495s
sys 0m0.103s

$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt 

$ uniq -c moby-dick-sorted-merge.txt | tail      
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

所以这个分裂文成连续的10000线块，用平行于每个块进行排序，然后使用排序-m 来排序块合并成一个完整的排序

So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m to merge the sorted chunks into a complete sort.

下方法是做的努力工作在分离阶段，而不是合并的阶段，从而使部分结果可以通过简单的猫合并在一起：

The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:

  $ rm tmp/*
  $ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
  $ time (
    cd tmp; 
    parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
    parallel sort {}.txt -o {}.sorted.txt ::: $letters;  
    cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
    rm *.txt )

  real  0m1.015s
  user  0m2.355s
  sys   0m0.510s
  $ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
  $ uniq -c moby-dick-words-sorted-split.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

下面我们（并行）分割由该行的第一个字符的文件;这些文件分别进行排序;然后合并是一个简单的串连。

Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.

请注意，这确实为娱乐/教育用途; GNU的更高版本排序已并行建造（看 - 平行选项），它会做一个更好的工作莫过于此。和的合并方法的雨衣的版本可以在这个答案。

Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.

这篇关于如何使用GNU并行写多核排序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用GNU并行写多核排序 [英] How to write multicore sorting using GNU Parallel

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

如何使用GNU并行写多核排序 [英] How to write multicore sorting using GNU Parallel

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭