如何选择从文件随机行 [英] How to select random lines from a file

查看:91
本文介绍了如何选择从文件随机行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有含有10数百行,具有不同长度的文本文件。现在,我想选择N行随机,它们保存在另一个文件,并从原来的文件删除它们。
我已经找到了一些回答这个问题,但他们大多使用一个简单的想法:文件进行排序,然后选择第一个或最后N行。遗憾的是这个想法行不通给我,因为我想preserve行的顺序。
我想这件作品code,但是它很慢,需要时间。

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file. I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines. I tried this piece of code, but it's very slow and takes hours.

FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e  ${X}d ${FILEsrc};       #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done

我发现这条线最多的时间在code耗时之一:

I found this line the most time consuming one in the code:

sed -i -e  ${X}d ${FILEsrc};

有没有办法解决这个问题,使code更快?
由于我在赶时间的时候,我可能会要求你给我完整的C / C ++ code这样做?

is there any way to overcome this problem and make the code faster? Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?

推荐答案

一个简单的O(n)的算法中描述的:

A simple O(n) algorithm is described in:

<一个href=\"http://en.wikipedia.org/wiki/Reservoir_sampling\">http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

这篇关于如何选择从文件随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆