如何从列表中的文本文件单词中删除行? [英] How to remove lines from a text file words in a list?
问题描述
file1> word_list.txt>超过1,000,000条线
file1 > word_list.txt > over 1,000,000 Lines
file2> list.txt>超过1,000,000条线
file2 > list.txt > over 1,000,000 Lines
我有一个包含单词列表的文件.我想从一个大文本文件中删除该文件中所有单词的所有出现.
I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.
示例:
文件1
111
222
文本文件示例
111
222
333
444
555
输出
333
444
555
对于超过一百万行的大型文件,此代码非常慢:
This code be very slow for large files with over 1 million lines:
sed -e "$(sed 's:.*:s/&//ig:' word_list.txt)" list.txt
解决此问题的最合适方法是什么?
What is the most appropriate method for this problem?
推荐答案
假设, 文件每行结构一个单词,每个文件中的单词都是唯一的,可以对文件进行排序(或已经按排序顺序)
assumptions, files are structured one word per each line, words are unique in each file, files can be sorted (or in sorted order already)
$ comm -13 file1 file2
333
444
555
-1 suppress lines unique to file1
-3 suppress lines that appear in both files
这将为您提供file2中不包含在file1中的唯一单词(即差异file2 \ file1)
which will give you unique words in file2 which are not in file1 (that is set difference file2 \ file1)
这应该是最快的方法.如果可以测试替代解决方案,请发布时间.
This should be the fastest approach. Please post the timings if you can test alternative solutions.
或者,
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' file1 file2
只要您有足够的内存就应该可以工作.这不需要排序.
should work as long as you have enough memory. This doesn't require sorting.
这篇关于如何从列表中的文本文件单词中删除行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!