从文件中随机选择行 [英] Randomly selecting lines from files

查看:59
本文介绍了从文件中随机选择行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆文件,每个文件都有 5 行的标题.在文件的其余部分,一对行形成一个条目.我需要从这些文件中随机选择条目.如何选择随机文件和随机条目(行对,不包括标题)?

解决方案

如果文件足够小,则将成对的行读入内存并从该数据结构中随机选择.如果文件太大,Eugene Y 会提供正确答案:使用水库采样.>

这是对算法的直观解释.

逐行处理文件.选择 = 行,概率为 1/N,其中 N = 行号

换句话说,在第 1 行,我们将以 1/1 的概率选择第 1 行.在第 2 行,我们将更改选择到第 2 行,概率为 1/2.在第 3 行,我们将选择更改为第 3 行,概率为 1/3.等

为了直观的证明,想象一个有 3 行的文件:

 1 选择第 1 行./\.5 .5/\2 1 切换到第 2 行?/\/\.67 .33 .33 .67/\/\2 3 1 切换到第 3 行?

每个结果的概率:

第 1 行:.5 * .67 = 1/3第 2 行:.5 * .67 = 1/3第 3 行:.5 * .33 * 2 = 1/3

从那里开始,剩下的就是归纳.例如,假设文件有 4 行.我们已经让自己确信,从第 3 行开始,到目前为止(1、2、3)的每一行都有相同的机会成为我们当前的选择.当我们前进到第 4 行时,它将有 1/4 被选中的机会——正是它应该的样子,从而将前 3 行的概率减少了恰到好处的数量(1/3 * 3/4 = 1/4).

这是 Perl 常见问题解答,适合您的问题.

使用严格;使用警告;# 忽略 5 行.<>1 .. 5;# 使用水库采样从剩余的线中选择对.我的 (@picks, $n);直到(eof){我的@lines;$lines[$_] = <>对于 0 .. 1;$n++;@picks = @lines if rand($n) <1;}打印@picks;

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

解决方案

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

Here's an intuitive explanation for the algorithm.

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

For an intuitive proof, imagine a file with 3 lines:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

Here's the Perl FAQ answer, adapted to your problem.

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;

这篇关于从文件中随机选择行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆