根据列对重复项进行排序和删除 [英] Sort and remove duplicates based on column

查看:82
本文介绍了根据列对重复项进行排序和删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件:

$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10

我想基于第一列对文件进行排序,并使用sort删除重复项,但是事情并没有按预期进行.

I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.

$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1

它不是基于第一列进行排序的.

It is not sorting based on the first column.

$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1

它删除了542,9,1,418,1行,但我想保留一份.

It removes the 542,9,1,418,1 line but I'd like to keep one copy.

似乎第一种方法删除重复项但排序不正确,而第二种方法排序正确但删除了比我想要的更多的东西.我应该如何获得正确的结果?

It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

推荐答案

问题是,当您向sort提供key时,会在该特定字段中查找唯一的出现.由于显示了542,8,1,418,1行,因此sort将以542开头的接下来的两行视为重复行并将其过滤掉.

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.

您最好的选择是对所有列进行排序:

Your best bet would be to either sort all columns:

sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text

使用awk过滤重复的行并将其通过管道传输到sort.

use awk to filter duplicate lines and pipe it to sort.

awk '!_[$0]++' text | sort -t, -nk1,1

这篇关于根据列对重复项进行排序和删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆