如何有效地获取随机行10%在Linux的大文件呢? [英] How to efficiently get 10% of random lines out of the large file in Linux?

查看:567
本文介绍了如何有效地获取随机行10%在Linux的大文件呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要输出随机10%的文件共线的线。例如,文件中有百万行,然后我想输出随机100,000行出来的文件(100,000是100万的10%)。

I want to output random 10% lines of total lines of a file. For instance, file a has 1,000,000 lines then I want to output random 100,000 lines out of the file (100,000 being the 10% of 1,000,000) .

有一个容易做到这一点假设该文件是小的:

There is a easy to do this supposed that the file is small:

randomLine=`wc -l a | awk '{printf("%d\n",($1/10))}'`
sort -R a | head -n $randomLine

但使用排序-R是很慢的。它会执行专用随机计算。我的文件中有10,000,000行。排序需要太多的时间。反正是有存档不太敬业,不那么乱,但有效的抽样?

But using sort -R is very slow. It will perform a dedicated random computation. My file has 10,000,000 lines. Sorting takes too much time. Is there anyway to archive a less dedicated and not so random but efficient sampling?

修改思路:

  1. 要品尝线每十行是可以接受的。但我不知道怎么用shell脚本做到这一点。
  2. 逐行读取,如果

  1. To sample a line every ten lines is acceptable. But I don't know how to do this with shell script.
  2. Read line by line and if

echo $RANDOM%100 | bc

大于20比输出线(使用大于10的数字,以确保获得不低于10%的线路),一次输出10%的行,然后停止。但我不知道如何通过线使用shell脚本读取一行。

is greater than 20 than output the line (Using the number greater than 10 to ensure get no less than 10% of line) and once output 10% line then stop. But I don't know how to read line by line using shell script.

修改说明

我想使用shell脚本的原因是,我的文件中包含\ r字符。文件中的新行字符应该是\ n,而是在Python和Java的问候\ r的ReadLine()函数和\ N作为新行字符,它不适合我的需要。

The reason I want to use shell script is that my file contains \r characters. The new line character in the file should be \n but readline() function in Python and Java regards \r and \n as new line character, which doesn't fit my need.

推荐答案

我觉得这是最好的方法:

I think this is the best way:

file=your file here
lines_in_file=`wc -l < $file`
lines_wanted=$(($lines_in_file/10))

shuf -n $lines_wanted $file

另一种创造性的解决方案:

Another creative solution:

回声$ RANDOM 生成0和32767之间的随机数

echo $RANDOM generates a random number between 0 and 32767

然后,你可以这样做:

echo $(($RANDOM*100000/32767+1))

..获得1和100000之间的随机数(如 nwellnhof 指出,在下面的评论,这不是任何的数量从1至100000,但1到100000之间的32768可能的数字之一,因此它是一种投影的......)

.. to obtain a random number between 1 and 100000 (as nwellnhof points out in comments below, it's not any number from 1 to 100000, but one of 32768 possible numbers between 1 and 100000, so it's kind of a projection...)

所以:

file=your file here
lines_in_file=`wc -l $file | awk {'print $1'}`
lines_wanted=$(($lines_in_file/10))
for i in `seq 1 $lines_wanted`
 do line_chosen=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line_chosen}q;d" $file
done

这篇关于如何有效地获取随机行10%在Linux的大文件呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆