找到一个文件不在行的另一个快速的方法吗? [英] Fast way of finding lines in one file that are not in another?

查看:112
本文介绍了找到一个文件不在行的另一个快速的方法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大文件(文件名设置的)。大致30.000行中的每个文件。我试图找到发现file1中不在文件2 present线的快速方式。

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

例如,如果这是文件1:

For example, if this is file1:

line1
line2
line3

这是文件2:

line1
line4
line5

然后我的结果/输出应该是:

Then my result/output should be:

line2
line3

本作品:

的grep -v -f文件2文件1

但在我的大文件使用时是非常,非常缓慢。

But it is very, very slow when used on my large files.

我怀疑还有就是做这个用的diff()的好方法,但输出应的只是的线条,没有别的,我似乎无法找到一个开关。

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

谁能帮我找到这样一来,使用bash和基本的Linux二进制文件的快捷方式?

Can anyone help me find a fast way of doing this, using bash and basic linux binaries?

编辑:在我自己的问题跟进,这是迄今为止我发现用差异()的最佳方式:

To follow up on my own question, this is the best way I have found so far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

当然,必须有一个更好的办法?

Surely, there must be a better way?

推荐答案

您可以通过控制新/旧/不变,GNU线的格式差异输出达到这个

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

输入文件应被分类为这个工作。随着庆典(和的zsh ),您可以进行排序就地与进程替换≤( )

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

在上面的新的的和的不变的行燮pressed,所以只的修改的(即删除你的情况线条)输出。

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output.

说明

选项 - 换行格式 - 旧行的格式 - 不变的行格式让你控制差异格式之间的差异的方式,类似于的printf 格式说明。这些选项格式的新的的(加)的的(删除)和不变的分别线路。设置一个空那种行的prevents输出。

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

如果您熟悉的统一的diff 的格式,可以使用部分重新创建它:

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

%L 说明是有问题的线路,我们preFIX每一个+, - 或,如差异-u
(请注意,它只是输出不同,它缺乏 --- +++ 在每个分组改变的顶部@@ 行)。
您也可以使用此做其他有用的东西如数量与<$ c分别的行$ C>%DN 。

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u (note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change). You can also use this to do other useful things like number each line with %dn.

差异方法(与其他建议一起通讯加入)只生产与整理的输入预期的输出,虽然你可以使用≤(有点...)到位排序。这里有一个简单的 AWK (NAWK)脚本(通过脚本启发链接到Konsolebox的答案),它接受任意有序输入文件的的输出缺少订单行,他们发生在文件1。

The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

这一行的行号索引数组存储文件1行的全部内容 LL1 [] ,和file2线的线 - 的全部内容行内容索引关联数组 SS2 [] 。这两个文件被读取后,遍历 LL1 和使用运营商,以确定是否在文件1的线是present file2中。 (这将会有不同的输出,如果有重复的差异的方法。)

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

在事件中的文件是足够大的存储他们两个会导致内存问题,您可以通过仅文件1存储和一起作为文件2被读取的方式删除匹配换取内存CPU。

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

以上门店文件1的全部内容在两个数组,一个行号 LL1 [] ,一个在线内容收录 SS1 [索引] 。然后,当文件2被读取时,每个匹配线从 LL1 [] SS1 [] 删除。在结束file1中的其余线路输出,preserving原来的顺序。

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

在这种情况下,作为规定的问题,你也可以的分而治之的使用GNU 拆分(过滤是GNU扩展)与文件1大块,彻底每次读取文件2重复运行:

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

请注意的使用和放置 - 标准输入 GAWK 命令行。这是由拆分中的大块从提供的文件1 20000行每调用。

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

有关非GNU系统的用户,几乎肯定是GNU coreutils软件包可以获取,包括OSX的的苹果X $ C $ç工具,它提供了GNU 差异 AWK ,虽然只有POSIX / BSD 拆分,而不是一个GNU版本。

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

这篇关于找到一个文件不在行的另一个快速的方法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆