命令行" sort |的替代方法uniq -c |排序-n“ [英] alternative for command line "sort | uniq -c | sort -n"

查看:297
本文介绍了命令行" sort |的替代方法uniq -c |排序-n“的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用sort | uniq -c | sort -n已有多年,但是今天失败了,因为我的输入文件为10 GB,而我的/tmp为1 GB宽:

I use sort | uniq -c | sort -n for years but today it fails as my input file is 10 GB and my /tmp is 1 GB wide:

sort: write failed: /tmp/sortmIGbL: No space left on device

因此,我正在寻找一种日常使用的有效替代品:

Therefore I am looking for an efficient alternative for everyday use:

    可以使用
  • awk,但没有排序的关联数组

  • awk may be used but there is no sorted associative array

perl似乎是一个不错的选择,但是来自10年的解决方案perlmonks.org 似乎不起作用

perl seems to be a good option but the 10-years-old solution from perlmonks.org does not seem to work

no warnings;
$^W=0;
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my ($buffer, %h) = ''; keys %h = 1024*500;
while (sysread($in, $buffer, 16384, length $buffer)) {
    $h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg;
    $buffer = substr($buffer, rindex($buffer, "\n"));
}
print scalar keys %h;

  • 当我使用Linux/Cygwin/Solaris/* BSD/...时,我愿意接受任何想法(便携式或非便携式)
  • 您可以自由使用所需的脚本语言(awk/perl/...)
  • As I use Linux/Cygwin/Solaris/*BSD/... I am open to any idea (portable or not)
  • You are free to use the scripting language you want (awk/perl/...)
a
BB
ccccc
dddddddd
a
BB
a

可能的输出之一

    3 a
    2 BB
    1 dddddddd
    1 ccccc

推荐答案

命令链中的第一个sort是使用所有资源的命令.首先获取唯一行,然后进行排序,以减少问题所在:

The first sort in your chain of commands is the one using all the resources. Reduce the problem set by getting the unique lines first, then sorting:

perl -ne '
    $count{$_}++;
    END {
        print "$count{$_} $_" for sort {
            $count{$b} <=> $count{$a} || $b cmp $a
        } keys %count
    }
' input.txt

您有66,000个7字节的唯一行,因此哈希键占用的内存将为66,000 * 56个字节(每个标量)= 3,696,000字节(这些键).这不包括哈希的计数和开销,但是毫无疑问,这种方法很容易解决问题.

You have 66,000 unique lines of 7 bytes, so you the memory taken up by the hash keys is going to be 66,000 * 56 bytes for each of those scalars = 3,696,000 bytes for the keys. That doesn't include the counts and the overhead of the hash, but there's no doubt this approach will easily do the trick.

这篇关于命令行&quot; sort |的替代方法uniq -c |排序-n“的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆