不放回抽样用awk [英] Sampling without replacement using awk

查看:235
本文介绍了不放回抽样用awk的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多的文本文件看起来像这样的:

I have a lot of text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

有没有办法用awk做了采样,无需更换?

Is there a way to do a sampling without replacement using awk?

例如,我有这样的8行,我只是想品尝这些随机在一个新的文件4,无需更换。
输出应该是这个样子:

For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement. The output should look something like this:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

在此先感谢

推荐答案

这个怎么样对你行10%的随机抽样?

How about this for a random sampling of 10% of your lines?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

我不知道你所说的替代品的意思...有发生在这里没有更换,只是随机选择。

I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.

基本上,它看起来在每个文件$ P $的每一行pcisely一次并在区间0到1产生一个随机数如果随机数大于0.9,线路输出。因此,基本上是滚动10面的骰子的每一行,如果骰子出来作为10正在打印线没有机会只有两次打印出来 - 除非它在文件中出现了两次,当然

Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.

有关由@klashxx建议增加随机性(!),你可以添加一个函数srand()在开始

For added randomness (!) you can add an srand() at the start as suggested by @klashxx

awk 'BEGIN{srand()} rand()>0.9' yourfile(s)

这篇关于不放回抽样用awk的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆