外壳:选择平面文件的唯一行 [英] shell: select unique row of flat file

查看:156
本文介绍了外壳:选择平面文件的唯一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个平面文件看起来像这样

I have a flat file looks like this

cat file

ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
ID3...

你可以从数据样本看,每个ID,也有对ID多个值,他们可能是任何值 - 相同与否。对于我来说,我不在乎它是否选取哪个值了。任何值对我的作品。

As you can see from the data sample, for each ID, there are several values for that ID and they could be whatever value - same or not. For me, I don't care which value it is picking up. Any value works for me.

所以,我只希望每个ID只能有一个值。我真的不关心哪一个,但如果我必须选择,我要说它有最长的行。

So I only want one value from each ID. I don't really care which one, but if I have to choose, I would say the row which has the longest length.

ID1, VALUE1_2
ID2, VALUE2_1
ID3, VALUE3_1

这可能在Python中完成,但有一个简单的方法来做到这一点壳牌本身,我愿意用awk或者sed但请不要写的awk code的一整段,请..

It might be done in Python but is there an easy way to do that in Shell itself, I am open to use sed or awk but please don't write a whole paragraph of awk code, please..

这可能看起来像:

# Pseudo code
# sort -k 1 file | uniq (max(length) by id)  

非常感谢!

推荐答案

这会发现每个ID的第一行:

This will find the first line for each ID:

awk -F, '!seen[$1]++' file

说明:


  • AWK关联数组不必须是pre-声明,所以第一次遇到一个ID,看到[$ 1] 将具有值为零(对于数字上下文)。

  • 看到[$ 1] ++ 后递增的关联数组元素,让前pression计算结果为零首次一个ID看到,并评估了一些积极整数任何时候。

  • AWK零对待虚假和任何其他数字是真实的,所以我们否定后增前pression与运营商。现在我们只有一个ID被认为是首次真正的前pression:看到[$ 1] ++

  • awk程序看起来像 {条件1} body1 {条件2} body2 ...

    • 将只在其对应的条件计算结果为真执行。

    • 如果条件为present但省略了身体,默认操作为 {打印}

    • 来完成,当人体present但省略的情况下,默认的条件计算为真和行动将为每个记录来执行。

    • awk associative arrays to not have to be pre-declared, so the first time an ID is encountered, seen[$1] will have the value zero (for numeric context).
    • seen[$1]++ post-increments the associative array element, so that expression evaluates to zero the first time an ID is seen, and evaluates to some positive integer any other time.
    • awk treats zero as false and any other number as true, so we negate the post-increment expression with the ! operator. Now we have a true expression only when an ID is seen for the first time: !seen[$1]++
    • awk programs look like condition1 {body1} condition2 {body2} ....
      • The body will be executed only when its corresponding condition evaluates to true.
      • If the condition is present but the body is omitted, the default action is {print}
      • to be complete, when the body is present but the condition is omitted, the default condition evaluates to true and the action will be performed for every record.

      综上所述,本awk程序将打印,只要前任pression计算结果为真,这将只在第一时间发出ID被认为是当前记录。

      To sum up, this awk program will print the current record whenever the expression evaluates to true, which will only be the first time an ID is seen.

      如果你真的想每个ID最长行:

      If you really want the longest line for each ID:

      awk '
          length($2) > max[$1] {max[$1] = length($2); line[$1] = $0}
          END {for (id in line) {print line[id]}}
      ' file
      

      这可能洗牌ID的顺序(关联数组是无序的集合)。您随时可以通过管道将进入排序如果这是一个问题。

      This may shuffle the order of the ids (associative arrays are unordered collections). You can always pipe that into sort if it's a problem.

      这篇关于外壳:选择平面文件的唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆