外壳:选择平面文件的唯一行 [英] shell: select unique row of flat file
问题描述
我有一个平面文件看起来像这样
I have a flat file looks like this
cat file
ID1, VALUE1_1
ID1, VALUE1_2
ID1, VALUE1_3
ID2, VALUE2_1
ID2, VALUE2_1
ID3, VALUE3_1
ID3...
你可以从数据样本看,每个ID,也有对ID多个值,他们可能是任何值 - 相同与否。对于我来说,我不在乎它是否选取哪个值了。任何值对我的作品。
As you can see from the data sample, for each ID, there are several values for that ID and they could be whatever value - same or not. For me, I don't care which value it is picking up. Any value works for me.
所以,我只希望每个ID只能有一个值。我真的不关心哪一个,但如果我必须选择,我要说它有最长的行。
So I only want one value from each ID. I don't really care which one, but if I have to choose, I would say the row which has the longest length.
ID1, VALUE1_2
ID2, VALUE2_1
ID3, VALUE3_1
这可能在Python中完成,但有一个简单的方法来做到这一点壳牌本身,我愿意用awk或者sed但请不要写的awk code的一整段,请..
It might be done in Python but is there an easy way to do that in Shell itself, I am open to use sed or awk but please don't write a whole paragraph of awk code, please..
这可能看起来像:
# Pseudo code
# sort -k 1 file | uniq (max(length) by id)
非常感谢!
推荐答案
这会发现每个ID的第一行:
This will find the first line for each ID:
awk -F, '!seen[$1]++' file
说明:
- AWK关联数组不必须是pre-声明,所以第一次遇到一个ID,
看到[$ 1]
将具有值为零(对于数字上下文)。 -
看到[$ 1] ++
的后递增的关联数组元素,让前pression计算结果为零首次一个ID看到,并评估了一些积极整数任何时候。 - AWK零对待虚假和任何其他数字是真实的,所以我们否定后增前pression与
!
运营商。现在我们只有一个ID被认为是首次真正的前pression:看到[$ 1] ++
! - awk程序看起来像
{条件1} body1 {条件2} body2 ...
。- 的
体
将只在其对应的条件
计算结果为真执行。 - 如果条件为present但省略了身体,默认操作为
{打印}
- 来完成,当人体present但省略的情况下,默认的条件计算为真和行动将为每个记录来执行。
- awk associative arrays to not have to be pre-declared, so the first time an ID is encountered,
seen[$1]
will have the value zero (for numeric context). seen[$1]++
post-increments the associative array element, so that expression evaluates to zero the first time an ID is seen, and evaluates to some positive integer any other time.- awk treats zero as false and any other number as true, so we negate the post-increment expression with the
!
operator. Now we have a true expression only when an ID is seen for the first time:!seen[$1]++
- awk programs look like
condition1 {body1} condition2 {body2} ...
.- The
body
will be executed only when its correspondingcondition
evaluates to true. - If the condition is present but the body is omitted, the default action is
{print}
- to be complete, when the body is present but the condition is omitted, the default condition evaluates to true and the action will be performed for every record.
综上所述,本awk程序将打印,只要前任pression计算结果为真,这将只在第一时间发出ID被认为是当前记录。
To sum up, this awk program will print the current record whenever the expression evaluates to true, which will only be the first time an ID is seen.
如果你真的想每个ID最长行:
If you really want the longest line for each ID:
awk ' length($2) > max[$1] {max[$1] = length($2); line[$1] = $0} END {for (id in line) {print line[id]}} ' file
这可能洗牌ID的顺序(关联数组是无序的集合)。您随时可以通过管道将进入
排序
如果这是一个问题。This may shuffle the order of the ids (associative arrays are unordered collections). You can always pipe that into
sort
if it's a problem.这篇关于外壳:选择平面文件的唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- The
- 的