在一个文件中查找不在另一个文件中的行的快速方法? [英] Fast way of finding lines in one file that are not in another?

查看:41
本文介绍了在一个文件中查找不在另一个文件中的行的快速方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大文件(文件名集).每个文件大约有 30.000 行.我试图找到一种快速的方法来查找文件 1 中文件 2 中不存在的行.

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

例如,如果这是 file1:

line1
line2
line3

这是file2:

line1
line4
line5

那么我的结果/输出应该是:

line2
line3

这有效:

grep -v -f file2 file1

但是在我的大文件上使用时它非常非常慢.

But it is very, very slow when used on my large files.

我怀疑使用 diff() 有一个很好的方法可以做到这一点,但输出应该是只是行,没有别的,我似乎找不到一个开关.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

谁能帮我找到一种快速的方法,使用 bash 和基本的 Linux 二进制文件?

Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?

编辑:要跟进我自己的问题,这是我迄今为止使用 diff() 找到的最佳方法:

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

 diff file2 file1 | grep '^>' | sed 's/^> //'

当然,一定有更好的方法吗?

Surely, there must be a better way?

推荐答案

您可以通过控制 GNU diff 输出中旧/新/未更改行的格式来实现此目的:

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

输入文件应该排序以使其工作.使用 bash(和 zsh),您可以使用进程替换 <( ) 就地排序:

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

在上面的 newunchanged 行被抑制,所以只有 changed (即在你的情况下被删除的行)被输出.您还可以使用其他解决方案不提供的一些 diff 选项,例如 -i 忽略大小写,或各种空白选项(-Ecode>、-b-v 等)用于不太严格的匹配.

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.

说明

选项 --new-line-format--old-line-format--unchanged-line-format 让您可以控制 diff 格式化差异的方式,类似于 printf 格式说明符.这些选项分别格式化 new(添加)、old(删除)和 unchanged 行.将一个设置为空 "" 可防止输出该类型的行.

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

如果您熟悉统一差异格式,您可以部分重新创建它:

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" 
     --new-line-format="+%L" file1 file2

%L 说明符是有问题的行,我们在每行前面加上+"、-"或",就像 diff -u(注意它只输出差异,它缺少每个分组更改顶部的 --- +++@@ 行).你也可以用它来做其他有用的事情,比如 给每一行编号%dn.

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u (note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change). You can also use this to do other useful things like number each line with %dn.

diff 方法(以及其他建议 commjoin)仅产生带有 sorted 的预期输出输入,但您可以使用 <(sort ...) 进行原位排序.这是一个简单的 awk (nawk) 脚本(受 Konsolebox 答案中链接到的脚本的启发),它接受任意排序的输入文件,按照它们的顺序输出缺失的行发生在文件 1 中.

The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

这将 file1 的全部内容逐行存储在行号索引数组 ll1[] 中,并将 file2 的全部内容逐行存储在 line-content 索引关联数组 ss2[].读取两个文件后,遍历 ll1 并使用 in 运算符确定 file1 中的行是否存在于 file2 中.(如果有重复,这将对 diff 方法有不同的输出.)

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

如果文件足够大以至于存储它们都导致内存问题,您可以通过仅存储文件 1 并在读取文件 2 的过程中删除匹配项来交换 CPU 内存.

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

上面将file1的全部内容存储在两个数组中,一个以行号ll1[]为索引,一个以行号ss1[]为索引.然后在读取 file2 时,从 ll1[]ss1[] 中删除每个匹配的行.最后输出 file1 中剩余的行,保留原始顺序.

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

在这种情况下,对于上述问题,您还可以分而治之使用 GNU split(过滤是 GNU 扩展),重复运行文件 1 块并且每次都完全读取 file2:

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

注意 - 的使用和位置,意思是 stdingawk 命令行上.这是由 split 从 file1 以每次调用 20000 行的块提供的.

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

对于非 GNU 系统上的用户,几乎可以肯定您可以获得 GNU coreutils 包,包括在 OSX 上作为 Apple Xcode 工具,提供 GNU diffawk,尽管只有 POSIX/BSD split 而不是一个 GNU 版本.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

这篇关于在一个文件中查找不在另一个文件中的行的快速方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆