如何在2个不同的文件中查找重复的行?的Unix [英] How to find duplicate lines across 2 different files? Unix
问题描述
在unix终端上,我们可以使用 diff file1 file2
来查找两个文件之间的差异。有没有类似的命令可以显示两个文件之间的相似性? (如有必要,允许使用许多管道。
From the unix terminal, we can use diff file1 file2
to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.
每个文件包含一行带有字符串句子的字符串;对它们进行排序,并使用 sort file1 | uniq删除重复的行
。
Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq
.
file1
: http://pastebin.com/taRcegVn
file2
: http://pastebin.com/2fXeMrHQ
输出将输出出现在两个文件中的行。
And the output should output the lines that appears in both files.
输出
: http://pastebin.com/FnjXFshs
我能够使用python这样做,但我认为要放入终端机有点过多:
I am able to use python to do it as such but i think it's a little too much to put into the terminal:
x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
print>>outfile, i
推荐答案
正如@tjameson所说的那样在另一个线程中解决。
只是想发布另一个解决方案:
sort file1 file2 | awk'dup [$ 0] ++ == 1'
As @tjameson mentioned it may be solved in another thread.
Just would like to post another solution:
sort file1 file2 | awk 'dup[$0]++ == 1'
-
指的是 awk指南,当一行的模式值很大时,可以获得一些awk
基础知识是真的,这行将被打印
refer to awk guide to get some awk basics, when the pattern value of a line is true this line will be printed
dup [$ 0]是一个哈希表,其中每个键是输入的每一行,
的原始值为0,并在此行出现后递增,当
再次出现时,该值应为1,因此 dup [$ 0] ++ == 1
是正确的。
然后将打印此行。
dup[$0] is a hash table in which each key is each line of the input,
the original value is 0 and increments once this line occurs, when
it occurs again the value should be 1, so dup[$0]++ == 1
is true.
Then this line is printed.
请注意,这仅在两个文件中没有重复项时有效,如问题中所指定。
Note that this only works when there are not duplicates in either file, as was specified in the question.
这篇关于如何在2个不同的文件中查找重复的行?的Unix的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!