在bash分类类别中随机拉样品独特 [英] Pull random unique samples within sorted categories in bash
问题描述
我有一个大的未排序的CSV文件(> 4M记录)。每个记录都有一个类别,它在第一个三列说明。该记录的其余部分是地址数据可以是也可以不是唯一的。
I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.
A, 1, c, address1 # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3 # the category for this record is C3e
B, 2, a, address4
我想每个类别(类别所以5种独特的记录中拉重复记录的随机样本 A1T
,从 5唯一的记录C3E
等)。我使用排序
放在一起的部分解决方案。然而,这只是拉一个非随机记录在每个类别中:
I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t
, 5 unique records from C3e
, etc.). I put together a partial solution using sort
. However, it only pulls one non-random record in each category:
sort -u -t, -k1,3
有没有办法在每个类别中拉几个随机抽样记录?
Is there a way to pull several random sample records within each category?
我想一定有办法通过管道的组合来做到这一点, uniq的
, AWK
或 SHUF
,但一直没能弄明白。因为我想知道这是否可能只使用bash的我会preFER一个命令行的解决方案。
I think there must be a way to do this by using a combination of pipes, uniq
, awk
or shuf
, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.
推荐答案
通过使用排序-R
中的由jm666 答案。这是一个GNU扩展到排序
,所以它可能不会在非GNU系统的工作。
Inspired by the use of sort -R
in the answer by jm666. This is a GNU extension to sort
, so it may not work on non-Gnu systems.
在这里,我们使用排序整个文件进行排序,以随机顺序排序的非类领域。自的类别字段是主键,其结果是在类别顺序与以下字段的随机顺序。
Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.
从那里,我们需要找到在每个类别中的前五个条目。可能有hackier方法可以做到这一点,但我有一个简单的 AWK
节目去了。
From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk
program.
sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'
如果你的排序
不随机化,那么随机样本可以用 AWK
提取:
If your sort
doesn't randomise, then the random sample can be extracted with awk
:
# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
function sample(){
for(;n>5;--n)v[int(n*rand())+1]=v[n];
for(;n;--n)print v[n]
}
a!=$1$2$3{a=$1$2$3;sample()}
{v[++n]=$0}
END {sample()}'
这也将有可能保持awk的所有条目,以避免排序,但是这很可能是慢了很多,它会使用的内存量过高
It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.
这篇关于在bash分类类别中随机拉样品独特的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!