命令行" sort |的替代方法uniq -c |排序-n“ [英] alternative for command line "sort | uniq -c | sort -n"
问题描述
我使用sort | uniq -c | sort -n
已有多年,但是今天失败了,因为我的输入文件为10 GB,而我的/tmp
为1 GB宽:
I use sort | uniq -c | sort -n
for years but today it fails as my input file is 10 GB and my /tmp
is 1 GB wide:
sort: write failed: /tmp/sortmIGbL: No space left on device
因此,我正在寻找一种日常使用的有效替代品:
Therefore I am looking for an efficient alternative for everyday use:
-
可以使用
-
awk
,但没有排序的关联数组
awk
may be used but there is no sorted associative array
perl
似乎是一个不错的选择,但是来自10年的解决方案perlmonks.org 似乎不起作用
perl
seems to be a good option but the 10-years-old solution from perlmonks.org does not seem to work
no warnings;
$^W=0;
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my ($buffer, %h) = ''; keys %h = 1024*500;
while (sysread($in, $buffer, 16384, length $buffer)) {
$h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg;
$buffer = substr($buffer, rindex($buffer, "\n"));
}
print scalar keys %h;
- 当我使用Linux/Cygwin/Solaris/* BSD/...时,我愿意接受任何想法(便携式或非便携式)
- 您可以自由使用所需的脚本语言(
awk
/perl
/...)
- As I use Linux/Cygwin/Solaris/*BSD/... I am open to any idea (portable or not)
- You are free to use the scripting language you want (
awk
/perl
/...)
a
BB
ccccc
dddddddd
a
BB
a
可能的输出之一
3 a
2 BB
1 dddddddd
1 ccccc
推荐答案
命令链中的第一个sort
是使用所有资源的命令.首先获取唯一行,然后进行排序,以减少问题所在:
The first sort
in your chain of commands is the one using all the resources. Reduce the problem set by getting the unique lines first, then sorting:
perl -ne '
$count{$_}++;
END {
print "$count{$_} $_" for sort {
$count{$b} <=> $count{$a} || $b cmp $a
} keys %count
}
' input.txt
您有66,000个7字节的唯一行,因此哈希键占用的内存将为66,000 * 56个字节(每个标量)= 3,696,000字节(这些键).这不包括哈希的计数和开销,但是毫无疑问,这种方法很容易解决问题.
You have 66,000 unique lines of 7 bytes, so you the memory taken up by the hash keys is going to be 66,000 * 56 bytes for each of those scalars = 3,696,000 bytes for the keys. That doesn't include the counts and the overhead of the hash, but there's no doubt this approach will easily do the trick.
这篇关于命令行" sort |的替代方法uniq -c |排序-n“的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!