仅列出基于分号分隔文件的一列的重复行? [英] List only duplicate lines based on one column from a semi-colon delimited file?
问题描述
我有一行文件.这些行中的每一行都有8个以分号分隔的列.
I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.
(在Linux中)如何仅基于第2列返回重复的行?我应该使用 grep
还是其他工具?
How can I (in Linux) return duplicate lines but only based on column number 2?
Should I be using grep
or something else?
推荐答案
在awk脚本中查看我的评论
See my comments in the awk script
$ cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416
$ cat dup.awk
BEGIN { FS = ";" }
{
# Keep count of the fields in second column
count[$2]++;
# Save the line the first time we encounter a unique field
if (count[$2] == 1)
first[$2] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$2] == 2)
print first[$2];
# From the second time onward. always print because the field is
# duplicated
if (count[$2] > 1)
print
}
示例输出:
$ sort -t ';' -k 2 data.txt | awk -f dup.awk
John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416
这是我的解决方案2:
Here is my solution #2:
awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt
此解决方案的优点在于,它保留了行顺序,但同时使用了许多工具(awk,sort,uniq和fgrep).
The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).
awk命令打印出第二个字段,然后将其输出排序.接下来,uniq -d命令挑选出重复的字符串.此时,标准输出包含重复的第二个字段的列表,每行一个.然后,我们将该列表通过管道传递给fgrep." -f-"标志告诉fgrep从标准输入中查找这些字符串.
The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.
是的,您可以使用命令行全力以赴.我更喜欢第二种解决方案,因为它可以使用许多工具并具有更清晰的逻辑(至少对我而言).缺点是使用了许多工具和可能的内存.另外,第二种解决方案效率不高,因为它扫描数据文件两次:第一次使用awk命令,第二次使用fgrep命令.仅当输入文件很大时,此注意事项才重要.
Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.
这篇关于仅列出基于分号分隔文件的一列的重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!