如何从列表中的文本文件单词中删除行? [英] How to remove lines from a text file words in a list?

查看:77
本文介绍了如何从列表中的文本文件单词中删除行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

file1> word_list.txt>超过1,000,000条线

file1 > word_list.txt > over 1,000,000 Lines

file2> list.txt>超过1,000,000条线

file2 > list.txt > over 1,000,000 Lines

我有一个包含单词列表的文件.我想从一个大文本文件中删除该文件中所有单词的所有出现.

I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

示例:

文件1

111
222

文本文件示例

111
222
333
444
555

输出

333
444
555

对于超过一百万行的大型文件,此代码非常慢:

This code be very slow for large files with over 1 million lines:

sed -e "$(sed 's:.*:s/&//ig:' word_list.txt)" list.txt

解决此问题的最合适方法是什么?

What is the most appropriate method for this problem?

推荐答案

假设, 文件每行结构一个单词,每个文件中的单词都是唯一的,可以对文件进行排序(或已经按排序顺序)

assumptions, files are structured one word per each line, words are unique in each file, files can be sorted (or in sorted order already)

$ comm -13 file1 file2

333
444
555

-1   suppress lines unique to file1
-3   suppress lines that appear in both files 

这将为您提供file2中不包含在file1中的唯一单词(即差异file2 \ file1)

which will give you unique words in file2 which are not in file1 (that is set difference file2 \ file1)

这应该是最快的方法.如果可以测试替代解决方案,请发布时间.

This should be the fastest approach. Please post the timings if you can test alternative solutions.

或者,

$ awk 'NR==FNR{a[$0]; next} !($0 in a)' file1 file2

只要您有足够的内存就应该可以工作.这不需要排序.

should work as long as you have enough memory. This doesn't require sorting.

这篇关于如何从列表中的文本文件单词中删除行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆