基于第二文本文件从文本文件中删除重复项 [英] Remove duplicates from text file based on second text file
问题描述
我怎样才能通过检查第二个文本文件(从一个文本文件( main.txt
)删除所有行 removethese.txt
)。什么是的有效的方法如果文件大于10-100mb更大。 [使用Mac]
示例:
main.txt
3
1
2
五
删除这些行
removethese.txt
3
2
9
输出:
output.txt的
1
五
示例行(这些都是实际的线我的工作 - 顺序无所谓):
ChIJW3p7Xz8YyIkRBD_TjKGJRS0
ChIJ08x-0kMayIkR5CcrF-xT6ZA
ChIJIxbjOykFyIkRzugZZ6tio1U
ChIJiaF4aOoEyIkR2c9WYapWDxM
ChIJ39HoPKDix4kRcfdIrxIVrqs
ChIJk5nEV8cHyIkRIhmxieR5ak8
ChIJs9INbrcfyIkRf0zLkA1NJEg
ChIJRycysg0cyIkRArqaCTwZ-E8
ChIJC8haxlUDyIkRfSfJOqwe698
ChIJxRVp80zpcEARAVmzvlCwA24
ChIJw8_LAaEEyIkR68nb8cpalSU
ChIJs35yqObit4kR05F4CXSHd_8
ChIJoRmgSdwGyIkRvLbhOE7xAHQ
ChIJaTtWBAWyVogRcpPDYK42 Nc的
ChIJTUjGAqunVogR90Kc8hriW8c
ChIJN7P2NF8eVIgRwXdZeCjL5EQ
ChIJizGc0lsbVIgRDlIs85M5dBs
ChIJc8h6ZqccVIgR7u5aefJxjjc
ChIJ6YMOvOeYVogRjjCMCL6oQco
ChIJ54HcCsaeVogRIy9___RGZ6o
ChIJif92qn2YVogR87n0-9R5tLA
ChIJ0T5e1YaYVogRifrl7S_oeM8
ChIJwWGce4eYVogRcrfC5pvzNd4
有两种标准的方式来做到这一点:
使用的grep
:
的grep -vxFf removethese主
本用途:
-
-v
来反转匹配。 -
-x
匹配全线飘红,至prevent,例如,他
来匹配行像你好
或公路到地狱
。 -
-F
使用固定的字符串,这样的参数被当作是,PTED作为一个普通的前pression不跨$ P $。 -
-f
来得到另一个文件的模式。在这种情况下,从removethese
。
使用 AWK
:
$ AWK'FNR == {NR一个[$ 0];}旁边!(在$ 0)'removethese主
1
五
这样我们就在 removethese
每一行存储在数组中 A []
。然后,我们读主
文件,只打印那些不在数组中present线。
How can I remove all lines from a text file (main.txt
) by checking a second textfile (removethese.txt
). What is an efficient approach if files are greater than 10-100mb. [Using mac]
Example:
main.txt
3
1
2
5
Remove these lines
removethese.txt
3
2
9
Output:
output.txt
1
5
Example Lines (these are the actual lines I'm working with - order does not matter):
ChIJW3p7Xz8YyIkRBD_TjKGJRS0
ChIJ08x-0kMayIkR5CcrF-xT6ZA
ChIJIxbjOykFyIkRzugZZ6tio1U
ChIJiaF4aOoEyIkR2c9WYapWDxM
ChIJ39HoPKDix4kRcfdIrxIVrqs
ChIJk5nEV8cHyIkRIhmxieR5ak8
ChIJs9INbrcfyIkRf0zLkA1NJEg
ChIJRycysg0cyIkRArqaCTwZ-E8
ChIJC8haxlUDyIkRfSfJOqwe698
ChIJxRVp80zpcEARAVmzvlCwA24
ChIJw8_LAaEEyIkR68nb8cpalSU
ChIJs35yqObit4kR05F4CXSHd_8
ChIJoRmgSdwGyIkRvLbhOE7xAHQ
ChIJaTtWBAWyVogRcpPDYK42-Nc
ChIJTUjGAqunVogR90Kc8hriW8c
ChIJN7P2NF8eVIgRwXdZeCjL5EQ
ChIJizGc0lsbVIgRDlIs85M5dBs
ChIJc8h6ZqccVIgR7u5aefJxjjc
ChIJ6YMOvOeYVogRjjCMCL6oQco
ChIJ54HcCsaeVogRIy9___RGZ6o
ChIJif92qn2YVogR87n0-9R5tLA
ChIJ0T5e1YaYVogRifrl7S_oeM8
ChIJwWGce4eYVogRcrfC5pvzNd4
There are two standard ways to do this:
With grep
:
grep -vxFf removethese main
This uses:
-v
to invert the match.-x
match whole line, to prevent, for example,he
to match lines likehello
orhighway to hell
.-F
to use fixed strings, so that the parameter is taken as it is, not interpreted as a regular expression.-f
to get the patterns from another file. In this case, fromremovethese
.
With awk
:
$ awk 'FNR==NR {a[$0];next} !($0 in a)' removethese main
1
5
Like this we store every line in removethese
in an array a[]
. Then, we read the main
file and just print those lines that are not present in the array.
这篇关于基于第二文本文件从文本文件中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!