得到在bash大文件随机线 [英] Get random lines from large files in bash

查看：168 发布时间：2016/8/3 11:13:22 bash command-line random-sample line-processing

本文介绍了得到在bash大文件随机线的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我如何获得 N 从无法装入内存非常大的文件随机线。

How can I get n random lines from very large files that can't fit in memory.

也将是巨大的，如果我能之前或随机分组后添加过滤器。

Also it would be great if I could add filters before or after the randomization.

在我的案件的规格是：

> 1亿行

> 10GB文件

通常随机批量10000-30000

512RAM托管Ubuntu服务器14.10

因此失去了文件中的几行，因为他们有一个1在10000的机会无论如何也不会这么大的问题，但是性能和资源的消耗将是一个问题。

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

推荐答案

在这样的限制因素，下面的方法会更好。

In such limiting factors, the following approach will be better.

寻求随机位置在文件中（例如，您将内部一些行）

从该位置向后去寻找给定行的开始

勇往直前，打印完整的行

为此，您需要一个工具，可以在文件中查找，例如 perl的。

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

上面保存到某些文件，例如randlines.pl，并把它作为：

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

例如

perl randlines.pl 10000 ./BIGFILE

该脚本非常低级的IO操作，即，它的非常快即可。（在我的笔记本，从10M选择30K行了半秒）。

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

这篇关于得到在bash大文件随机线的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

得到在bash大文件随机线 [英] Get random lines from large files in bash

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

得到在bash大文件随机线 [英] Get random lines from large files in bash

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭