按行中的单词数对大量文件的行进行排序(最好是并行) [英] Sort lines of massive file by number of words on line (ideally in parallel)

查看:83
本文介绍了按行中的单词数对大量文件的行进行排序(最好是并行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一种社区检测算法,用于分析来自Facebook的社交网络数据.可以并行高效地完成第一个任务,即检测图中的所有组,并为我提供如下输出:

I am working on a community detection algorithm for analyzing social network data from Facebook. The first task, detecting all cliques in the graph, can be done efficiently in parallel, and leaves me with an output like this:

17118 17136 17392
17064 17093 17376
17118 17136 17356 17318 12345
17118 17136 17356 17283
17007 17059 17116

每行都代表一个唯一的集团(节点ID的集合),我想按每行ID的数量按降序对这些行进行排序.在上面的示例中,输出如下所示:

Each of these lines represents a unique clique (a collection of node ids), and I want to sort these lines in descending order by the number of ids per line. In the case of the example above, here's what the output should look like:

17118 17136 17356 17318 12345
17118 17136 17356 17283
17118 17136 17392
17064 17093 17376
17007 17059 17116

(领带-即ID数相同的行-可以任意排序.)

(Ties---i.e., lines with the same number of ids---can be sorted arbitrarily.)

对这些行进行排序的最有效方法是什么.

What is the most efficient way of sorting these lines.

请牢记以下几点:

  1. 我要排序的文件可能大于计算机的物理内存
  2. 我在其上运行的大多数计算机都具有多个处理器,因此并行解决方案将是理想的
  3. 理想的解决方案就是使用shell脚本(可能使用 sort ),但是我愿意使用python或perl(或其他语言,例如只要它使任务变得简单)
  4. 从某种意义上说,这个任务很容易-我不仅在寻找任何旧的解决方案,而且在寻找一种简单且最重要的效率解决方案
  1. The file I want to sort could be larger than the physical memory of the machine
  2. Most of the machines that I'm running this on have several processors, so a parallel solution would be ideal
  3. An ideal solution would just be a shell script (probably using sort), but I'm open to simple solutions in python or perl (or any language, as long as it makes the task simple)
  4. This task is in some sense very easy---I'm not just looking for any old solution, but rather for a simple and above all efficient solution

更新2:最佳解决方案

基于对提出的解决方案进行基准测试(见下文),这是最好的解决方案(从Vlad那里选来,而Vlad则从这里提出的其他解决方案中进行了改编).它非常聪明,甚至不使用sort

Based on benchmarking the solutions proposed (see below), here is the best solution (taken from Vlad who in turn adapted it from other solutions proposed here). It's quite clever and does not even use sort

for FILE in infile.* ; do
  awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
    FILE=`basename $FILE` $FILE&
done
wait
ls -1r tmpfile.* | xargs cat >outfile
rm -f tmpfile.*

更新1:对提出的解决方案进行基准测试

为了进行基准测试,我采用了俄克拉荷马州立Facebook网络中发现的集团.包含这些团体的未排序文件看起来就像我上面显示的第一个示例一样,包含46,362,546行,这使文件大小最大为6.4 GB.这些团体几乎平均分布在8个文件中.我正在测试的系统包含4个物理处理器,每个物理处理器具有6个核心和12MB的L2缓存,总共有24个核心.它还包含128 GB物理内存.由于将要排序的行分为8个文件,因此大多数解决方案都使用8(或16)个并发进程.

For benchmarking I took the Cliques found in an Oklahoma State Facebook network. The unsorted files that contain these cliques look just like the first example I show above, contain 46,362,546 lines, which brings the filesize up to 6.4 GB. The cliques are almost evenly spread over 8 files. The system I am testing this on contains 4 physical processors, each with 6 cores and a 12MB L2 cache, for a total of 24 cores. It also contains 128 GB physical memory. Because the lines to be sorted were split into 8 files, most of these solutions used 8 (or 16) concurrent processes.

忽略第一个幼稚的方法,我对Vlad Romascanu(我选择的解决方案)的最后5条建议进行了基准测试.

Ignoring the first naive approach, I benchmarked the last 5 suggestions of Vlad Romascanu (the solution which I have selected).

第一个解决方案效率不高:

The first solution was not too efficient:

real    6m35.973s
user    26m49.810s
sys     2m14.080s

我尝试使用解决方案2、3和4,这些解决方案使用FIFO文件,但是它们每个都只使用一种排序过程,因此花费了很长时间(因此我在完成之前就杀死了它们)/

I tried using solutions 2, 3, and 4, which use FIFO files, but they each only used one sort process and were thus taking a long time (and so I killed these before they could finish)/

最后的解决方案是最快的:

The last solution was the quickest:

real    1m3.272s
user    1m21.540s
sys     1m22.550s

请注意,此解决方案的用户时间为1m21s,比第一个解决方案的26分钟要好得多.

Note that the user time of this solution is 1m21s, much better than the first solutions 26 minutes.

推荐答案

天真的方法可以很简单:

awk '{ print NF " " $0 }' infile| sort -k1,1nr |
 awk '{ $1=""; print $0 }' >outfile

最多可以保持3个CPU繁忙. sort不受可用物理内存量的限制,在诉诸临时目录(-T)中的临时文件之前,请使用-S-T开关配置要使用的内存量(-S).在足够大(最好是快速)的分区上.

This will keep up to 3 CPUs busy. sort is not limited by the amount of physical memory available, use the -S and -T switches to configure how much memory to use (-S) before resorting to temporary files in a temp directory (-T) on a big enough (and ideally fast) partition.

如果您可以通过细分进行排序阶段的工作来生成多个输入文件,则可以执行以下操作:

If you can produce several input files by subdividing the work leading up to the sort phase, you would then be able to do:

for FILE in infile.* ; do
  awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp&
done
wait
sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.tmp

这将最多使用N*2个CPU;此外,最后一种排序(合并排序)非常高效.

This will use up to N*2 CPUs; moreover, the last sort (merge-sort) is highly efficient.

进一步声明通过使用FIFO而不是中间文件来改善N*2+1的并行性,再次假设可以使用多个输入文件:

Refining further to improve parallelism to N*2+1 by using FIFOs instead of intermediate files, again assuming multiple input files are possible:

for FILE in infile.* ; do
  mkfifo $FILE.fifo
  awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo

如果不可能有多个输入文件,则可以模拟它们(添加I/O开销,希望可以通过可用进程数来摊销):

If multiple input files are not possible, you can simulate them (adding I/O overhead which will hopefully be amortized by the number of processes available):

PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
  mkfifo infile.$N.fifo
  awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile |
    sort -k1,1nr >infile.$N.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo

由于我们使用模数行数,因此具有良好的局部性,并且文件系统缓存理想地应使在$PARALLELISM进程中一遍又一遍地读取输入文件的成本接近于零.

Because we use modulo-line-number we have good locality and the filesystem cache should ideally bring the cost of reading the input file over and over in $PARALLELISM processes closer to zero.

更好,只读取一次输入文件,并将输入行循环到几个sort管道中:

Even better, reading the input file only once and round-robin-ing input lines into several sort pipes:

PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
  mkfifo infile.$N.fifo1
  mkfifo infile.$N.fifo2
  sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2&
done
awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile&
sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile
rm -f infile.$N.fifo[12]

您应该测量$PARALLELISM的各种值的性能,然后选择最佳的值.

You should measure performance for various values of $PARALLELISM then pick the optimal one.

如其他文章中所示,您当然可以使用cut而不是最终的awk(即,去除第一列的内容),以实现更高的效率. :)

As shown in other posts, you can of course use cut instead of the final awk (i.e. which strips the first column) for potentially better efficiency. :)

为您提供的文件名约定更新了所有脚本,并修复了上一个版本中的错误.

Updated all scripts for the filename convention you provided, and fixed a bug in the last version.

此外,使用新的文件名约定,如果I/O并非瓶颈,则 dave/niry解决方案的细微变化可能会更加有效:

Also, using the new filename convention, if I/O is not the bottleneck then a very slight variation on dave/niry's solutions should probably be even more efficient:

   for FILE in infile.* ; do
     awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
       FILE=`basename $FILE` $FILE&
   done
   wait
   ls -1r tmpfile.* | xargs cat >outfile
   rm -f tmpfile.*

这篇关于按行中的单词数对大量文件的行进行排序(最好是并行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆