如何从文件中读取N条随机行而不将文件存储在内存中? [英] How do I read N random lines out of a file without storing the file in memory?

查看:82
本文介绍了如何从文件中读取N条随机行而不将文件存储在内存中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉从文件中读取一条随机行而无需将整个文件读入内存的算法。我想知道是否可以将此技术扩展到N条随机行吗?



用例是用于密码生成器的,该密码生成器将从字典文件中拉出的N个随机词连接在一起,一个词每行(例如 / usr / share / dict / words )。您可能会想到 angela.ham.lewis.pathos 。现在,它将整个字典文件读入一个数组,并从该数组中选择N个随机元素。我想消除文件的数组或任何其他内存存储方式,只读取一次文件。



(不,这不实用优化练习。我对算法很感兴趣。)



更新
谢谢大家的回答。



答案分为三类:全读算法的修改,随机查找或对行进行索引并随机查找。



随机查找的速度要快得多,并且相对于文件大小而言是恒定的,但是基于文件大小而不是字数进行分配。它还允许重复(可以避免,但是使算法为O(inf))。这是我使用该算法对密码生成器的重新实现。我认识到,通过从搜寻点向前读取而不是向后读取,如果搜寻落在最后一行,它会有一个错误的错误。 b

 #!/ usr / bin / perl -lw 

my $ Words = / usr / share / dict / words;
我的$ Max_Length = 8;
我的$ Num_Words = 4;

my $ size = -s $ Words;

我的@words;
打开我的$ fh,<,$ Words或死亡$ !;

for(1 .. $ Num_Words){
寻求$ fh,int rand $ size,0或die $ !;
< $ fh> ;;
my $ word =< $ fh> ;;
chomp $ word;
重做,如果长度$ word> $ Max_Length;
push @ words,$ word;
}
print join。,@words;

然后是Guffa的答案,这正是我想要的;原始算法的扩展。速度较慢,它必须读取整个文件,但要按字分配,可以进行过滤而不会改变算法的效率,并且(我认为)没有重复项。

 #!/ usr / bin / perl -lw 

my $ Words = / usr / share / dict / words;
我的$ Max_Length = 8;
我的$ Num_Words = 4;

我的@words;
打开我的$ fh,<,$ Words或死亡$ !;
我的$ count = 0;
while(我的$ line =< $ fh>){
chomp $ line;
$ count ++;
if($ count< = $ Num_Words){
$ words [$ count-1] = $ line;
}
elsif(rand($ count)< = $ Num_Words){
$ words [rand($ Num_Words)] = $ line;
}
}

print join。,@words;最后,索引和查找算法具有按单词而不是文件大小进行分配的优势。缺点是它会读取整个文件,并且内存使用量与文件中的单词数成线性比例。最好也使用Guffa的算法。

解决方案

在该示例中,该算法的实现方式不是很好而清晰的。一些伪代码可以更好地解释它:

  cnt = 0 
而文件末尾{
读取行
cnt = cnt + 1
如果random(1 to cnt)= 1 {
结果=行
}
}

想法是您读取文件中的每一行并计算该行应为所选行的概率。读完第一行后,概率为100%,读完第二行后,概率为50%,依此类推。



这可以扩展为:保持大小为N的数组而不是单个变量,并计算一行替换数组中当前数组之一的概率:

  var result [1..N] 
cnt = 0
而不是文件结尾{
读行
cnt = cnt + 1
如果cnt< = N {
result [cnt] =行
}否则,如果random(1 to cnt)< = N {
result [random(1 to N)] =行
}
}

编辑:

这是代码在C#中实现:

  public static List< string> GetRandomLines(字符串路径,整数){
List< string>结果=新的List< string>();
Random rnd = new Random();
int cnt = 0;
字符串行;
使用(StreamReader reader = new StreamReader(path)){
while((line = reader.ReadLine())!= null){
cnt ++;
int pos = rnd.Next(cnt);
if(cnt< = count){
result.Insert(pos,line);
} else {
if(pos< count){
result [pos] = line;
}
}
}
}
返回结果;
}

我通过运行方法100000次进行了一次测试,从中选择了5行20,并计算了线条的出现次数。结果如下:

  25105 
24966
24808
24966
25279
24824
25068
24901
25145
24895
25087
25272
24971
24775
25024
25180
25027
25000
24900
24807

如您所见,分布非常理想。 :)



(我在运行测试时将 Random 对象的创建移出了方法,



注意:

如果您需要对结果数组进行排序,则可能会出现种子问题。希望他们被随机订购。由于前N行按顺序排列在数组中,因此如果它们保留在末尾,则不会随机放置。例如,如果N为3或更大并且选择了第三行,它将始终位于数组的第三位置。



编辑2:

我将代码更改为使用 List< string> 而不是 string [] 。这样就很容易以随机顺序插入前N个项目。我从新的测试运行中更新了测试数据,以便您可以看到分布仍然良好。


I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?

The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words). You might come up with angela.ham.lewis.pathos. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.

(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)

Update: Thank you all for your answers.

Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.

The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my $size = -s $Words;

my @words;
open my $fh, "<", $Words or die $!;

for(1..$Num_Words) {
    seek $fh, int rand $size, 0 or die $!;
    <$fh>;
    my $word = <$fh>;
    chomp $word;
    redo if length $word > $Max_Length;
    push @words, $word;
}
print join ".", @words;

And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my @words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
    chomp $line;
    $count++;
    if( $count <= $Num_Words ) {
        $words[$count-1] = $line;
    }
    elsif( rand($count) <= $Num_Words ) {
        $words[rand($Num_Words)] = $line;
    }
}

print join ".", @words;

Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.

解决方案

The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:

cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if random(1 to cnt) = 1 {
      result = line
   }
}

As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.

This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:

var result[1..N]
cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if cnt <= N {
      result[cnt] = line
   } else if random(1 to cnt) <= N {
      result[random(1 to N)] = line
   }
}

Edit:
Here's the code implemented in C#:

public static List<string> GetRandomLines(string path, int count) {
    List<string> result = new List<string>();
    Random rnd = new Random();
    int cnt = 0;
    string line;
    using (StreamReader reader = new StreamReader(path)) {
        while ((line = reader.ReadLine()) != null) {
            cnt++;
            int pos = rnd.Next(cnt);
            if (cnt <= count) {
                result.Insert(pos, line);
            } else {
                if (pos < count) {
                    result[pos] = line;
                }
            }
        }
    }
    return result;
}

I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:

25105
24966
24808
24966
25279
24824
25068
24901
25145
24895
25087
25272
24971
24775
25024
25180
25027
25000
24900
24807

As you see, the distribution is as good as you could ever want. :)

(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)

Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.

Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.

这篇关于如何从文件中读取N条随机行而不将文件存储在内存中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆