仅列出基于分号分隔文件的一列的重复行? [英] List only duplicate lines based on one column from a semi-colon delimited file?

查看:34
本文介绍了仅列出基于分号分隔文件的一列的重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一行文件.这些行中的每一行都有8个以分号分隔的列.

I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.

(在Linux中)如何仅基于第2列返回重复的行?我应该使用 grep 还是其他工具?

How can I (in Linux) return duplicate lines but only based on column number 2? Should I be using grep or something else?

推荐答案

在awk脚本中查看我的评论

See my comments in the awk script

$ cat data.txt 
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

$ cat dup.awk 
BEGIN { FS = ";" }

{
    # Keep count of the fields in second column
    count[$2]++;

    # Save the line the first time we encounter a unique field
    if (count[$2] == 1)
        first[$2] = $0;

    # If we encounter the field for the second time, print the
    # previously saved line
    if (count[$2] == 2)
        print first[$2];

    # From the second time onward. always print because the field is
    # duplicated
    if (count[$2] > 1)
        print
}

示例输出:

$ sort -t ';' -k 2 data.txt | awk -f dup.awk

John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416


这是我的解决方案2:


Here is my solution #2:

awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt

此解决方案的优点在于,它保留了行顺序,但同时使用了许多工具(awk,sort,uniq和fgrep).

The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).

awk命令打印出第二个字段,然后将其输出排序.接下来,uniq -d命令挑选出重复的字符串.此时,标准输出包含重复的第二个字段的列表,每行一个.然后,我们将该列表通过管道传递给fgrep." -f-"标志告诉fgrep从标准输入中查找这些字符串.

The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.

是的,您可以使用命令行全力以赴.我更喜欢第二种解决方案,因为它可以使用许多工具并具有更清晰的逻辑(至少对我而言).缺点是使用了许多工具和可能的内存.另外,第二种解决方案效率不高,因为它扫描数据文件两次:第一次使用awk命令,第二次使用fgrep命令.仅当输入文件很大时,此注意事项才重要.

Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.

这篇关于仅列出基于分号分隔文件的一列的重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆