根据列对重复项进行排序和删除 [英] Sort and remove duplicates based on column
问题描述
我有一个文本文件:
$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10
我想基于第一列对文件进行排序,并使用sort
删除重复项,但是事情并没有按预期进行.
I'd like to sort the file based on the first column and remove duplicates using sort
, but things are not going as expected.
$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1
它不是基于第一列进行排序的.
It is not sorting based on the first column.
$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1
它删除了542,9,1,418,1
行,但我想保留一份.
It removes the 542,9,1,418,1
line but I'd like to keep one copy.
似乎第一种方法删除重复项但排序不正确,而第二种方法排序正确但删除了比我想要的更多的东西.我应该如何获得正确的结果?
It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?
推荐答案
问题是,当您向sort
提供key
时,会在该特定字段中查找唯一的出现.由于显示了542,8,1,418,1
行,因此sort
将以542
开头的接下来的两行视为重复行并将其过滤掉.
The problem is that when you provide a key
to sort
the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1
is displayed, sort
sees the next two lines starting with 542
as duplicate and filters them out.
您最好的选择是对所有列进行排序:
Your best bet would be to either sort all columns:
sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text
或
使用awk
过滤重复的行并将其通过管道传输到sort
.
use awk
to filter duplicate lines and pipe it to sort
.
awk '!_[$0]++' text | sort -t, -nk1,1
这篇关于根据列对重复项进行排序和删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!