在bash分类类别中随机拉样品独特 [英] Pull random unique samples within sorted categories in bash

查看:110
本文介绍了在bash分类类别中随机拉样品独特的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的未排序的CSV文件(> 4M记录)。每个记录都有一个类别,它在第一个三列说明。该记录的其余部分是地址数据可以是也可以不是唯一的。

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.

A, 1, c, address1  # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3  # the category for this record is C3e
B, 2, a, address4

我想每个类别(类别所以5种独特的记录中拉重复记录的随机样本 A1T ,从 5唯一的记录C3E 等)。我使用排序放在一起的部分解决方案。然而,这只是拉一个非随机记录在每个类别中:

I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:

sort -u -t, -k1,3

有没有办法在每个类别中拉几个随机抽样记录?

Is there a way to pull several random sample records within each category?

我想一定有办法通过管道的组合来做到这一点, uniq的 AWK SHUF ,但一直没能弄明白。因为我想知道这是否可能只使用bash的我会preFER一个命令行的解决方案。

I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.

推荐答案

通过使用排序-R 中的由jm666 答案。这是一个GNU扩展到排序,所以它可能不会在非GNU系统的工作。

Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.

在这里,我们使用排序整个文件进行排序,以随机顺序排序的非类领域。自的类别字段是主键,其结果是在类别顺序与以下字段的随机顺序。

Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.

从那里,我们需要找到在每个类别中的前五个条目。可能有hackier方法可以做到这一点,但我有一个简单的 AWK 节目去了。

From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.

sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'

如果你的排序不随机化,那么随机样本可以用 AWK 提取:

If your sort doesn't randomise, then the random sample can be extracted with awk:

# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
      function sample(){
        for(;n>5;--n)v[int(n*rand())+1]=v[n];
        for(;n;--n)print v[n]
      }
      a!=$1$2$3{a=$1$2$3;sample()}
      {v[++n]=$0}
      END      {sample()}'

这也将有可能保持awk的所有条目,以避免排序,但是这很可能是慢了很多,它会使用的内存量过高

It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

这篇关于在bash分类类别中随机拉样品独特的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆