得到在bash大文件随机线 [英] Get random lines from large files in bash

查看:168
本文介绍了得到在bash大文件随机线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何获得 N 从无法装入内存非常大的文件随机线。

How can I get n random lines from very large files that can't fit in memory.

也将是巨大的,如果我能之前或随机分组后添加过滤器。

Also it would be great if I could add filters before or after the randomization.

在我的案件的规格是:


  • > 1亿行

  • > 10GB文件

  • 通常随机批量10000-30000

  • 512RAM托管Ubuntu服务器14.10

因此​​失去了文件中的几行,因为他们有一个1在10000的机会无论如何也不会这么大的问题,但是性能和资源的消耗将是一个问题。

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

推荐答案

在这样的限制因素,下面的方法会更好。

In such limiting factors, the following approach will be better.


  • 寻求随机位置在文件中(例如,您将内部一些行)

  • 从该位置向后去寻找给定行的开始

  • 勇往直前,打印完整的行

为此,您需要一个工具,可以在文件中查找,例如 perl的

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

上面保存到某些文件,例如randlines.pl,并把它作为:

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

例如

perl randlines.pl 10000 ./BIGFILE

该脚本非常低级的IO操作,即,它的非常快即可。 (在我的笔记本,从10M选择30K行了半秒)。

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

这篇关于得到在bash大文件随机线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆