在 Bash 中从另一个较大文件中查找文件行的最快方法 [英] Fastest way to find lines of a file from another larger file in Bash

查看:23
本文介绍了在 Bash 中从另一个较大文件中查找文件行的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件,file1.txtfile2.txt.file1.txt 大约有 14K 行,file2.txt 大约有 20 亿行.file1.txt 每行只有一个字段 f1file2.txt 有 3 个字段,f1f3,由 | 分隔.

I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

我想从 file2.txt 中找到所有行,其中 file1.txtf1f2 匹配file2.txt(如果我们不想花费额外的时间来拆分 file2.txt 的值,则可以在行上的任何位置).

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt(约 14K 行,未排序):

file1.txt (about 14K lines, not sorted):

foo1
foo2
...
bar1
bar2
...

file2.txt(约 20 亿行,未排序):

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

预期输出:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

这是我尝试过的,它似乎需要几个小时才能运行:

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

我想知道是否有更好更快的方法来使用常见的 Unix 命令或使用小脚本来执行此操作.

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

推荐答案

一小段 Perl 代码解决了这个问题.这是采取的方法:

A small piece of Perl code solved the problem. This is the approach taken:

  • file1.txt 的行存储在一个散列中
  • 逐行读取file2.txt,解析提取第二个字段
  • 检查提取的字段是否在哈希中;如果是这样,打印该行
  • store the lines of file1.txt in a hash
  • read file2.txt line by line, parse and extract the second field
  • check if the extracted field is in the hash; if so, print the line

代码如下:

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile
";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

<小时>

我用 file1.txt 中的 14K 行和 file2.txt 中的 1.3M 行运行了上述脚本.它在大约 13 秒内完成,产生了 126K 场比赛.这是相同的 time 输出:

real    0m11.694s
user    0m11.507s
sys 0m0.174s

我运行了@Inian的awk代码:

I ran @Inian's awk code:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

它比 Perl 解决方案慢得多,因为它为 file2.txt 中的每一行循环了 14K 次——这真的很昂贵.它在处理 file2.txt 的 592K 记录并产生 40K 匹配行后中止.这是花了多长时间:

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

使用@Inian 的其他 awk 解决方案,它消除了循环问题:

Using @Inian's other awk solution, which eliminates the looping issue:

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk 在这里非常令人印象深刻,因为我们不必编写整个程序来完成它.

awk is very impressive here, given that we didn't have to write an entire program to do it.

我也运行了@oliv 的 Python 代码.完成这项工作大约需要 15 个小时,看起来它产生了正确的结果.构建一个巨大的正则表达式不如使用哈希查找有效.这里time输出:

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

real    895m14.862s
user    806m59.219s
sys 1m12.147s

我尝试按照建议使用并行.但是,即使块大小非常小,它也因 fgrep: memory expired 错误而失败.

I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted error, even with very small block sizes.

令我惊讶的是 fgrep 完全不适合这个.我在 22 小时后中止了它,它产生了大约 10 万个匹配项.我希望 fgrep 有一个选项可以强制将 -f 文件 的内容保存在一个散列中,就像 Perl 代码所做的那样.

What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

我没有检查 join 方法 - 我不想要对文件进行排序的额外开销.此外,鉴于 fgrep 的性能不佳,我认为 join 不会比 Perl 代码做得更好.

I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe join would have done better than the Perl code.

感谢大家的关注和回复.

Thanks everyone for your attention and responses.

这篇关于在 Bash 中从另一个较大文件中查找文件行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆