在 Bash 中从另一个较大文件中查找文件行的最快方法 [英] Fastest way to find lines of a file from another larger file in Bash
问题描述
我有两个文件,file1.txt
和 file2.txt
.file1.txt
大约有 14K 行,file2.txt
大约有 20 亿行.file1.txt
每行只有一个字段 f1
而 file2.txt
有 3 个字段,f1
到 f3
,由 |
分隔.
I have two files, file1.txt
and file2.txt
. file1.txt
has about 14K lines and file2.txt
has about 2 billions. file1.txt
has a single field f1
per line while file2.txt
has 3 fields, f1
through f3
, delimited by |
.
我想从 file2.txt
中找到所有行,其中 file1.txt
的 f1
与 f2
匹配file2.txt
(如果我们不想花费额外的时间来拆分 file2.txt
的值,则可以在行上的任何位置).
I want to find all lines from file2.txt
where f1
of file1.txt
matches f2
of file2.txt
(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt
).
file1.txt(约 14K 行,未排序):
file1.txt (about 14K lines, not sorted):
foo1
foo2
...
bar1
bar2
...
file2.txt(约 20 亿行,未排序):
file2.txt (about 2 billion lines, not sorted):
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
预期输出:
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
这是我尝试过的,它似乎需要几个小时才能运行:
Here is what I have tried and it seems to be taking several hours to run:
fgrep -F -f file1.txt file2.txt > file.matched
我想知道是否有更好更快的方法来使用常见的 Unix 命令或使用小脚本来执行此操作.
I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.
推荐答案
一小段 Perl 代码解决了这个问题.这是采取的方法:
A small piece of Perl code solved the problem. This is the approach taken:
- 将
file1.txt
的行存储在一个散列中 - 逐行读取
file2.txt
,解析提取第二个字段 - 检查提取的字段是否在哈希中;如果是这样,打印该行
- store the lines of
file1.txt
in a hash - read
file2.txt
line by line, parse and extract the second field - check if the extracted field is in the hash; if so, print the line
代码如下:
#!/usr/bin/perl -w
use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile
";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
<小时>
我用 file1.txt 中的 14K 行和 file2.txt 中的 1.3M 行运行了上述脚本.它在大约 13 秒内完成,产生了 126K 场比赛.这是相同的 time
输出:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
我运行了@Inian的awk
代码:
I ran @Inian's awk
code:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
它比 Perl 解决方案慢得多,因为它为 file2.txt 中的每一行循环了 14K 次——这真的很昂贵.它在处理 file2.txt
的 592K 记录并产生 40K 匹配行后中止.这是花了多长时间:
It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt
and producing 40K matched lines. Here is how long it took:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1
real 55m5.539s
user 54m53.080s
sys 0m5.095s
使用@Inian 的其他 awk
解决方案,它消除了循环问题:
Using @Inian's other awk
solution, which eliminates the looping issue:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
在这里非常令人印象深刻,因为我们不必编写整个程序来完成它.
awk
is very impressive here, given that we didn't have to write an entire program to do it.
我也运行了@oliv 的 Python 代码.完成这项工作大约需要 15 个小时,看起来它产生了正确的结果.构建一个巨大的正则表达式不如使用哈希查找有效.这里time
输出:
I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s
user 806m59.219s
sys 1m12.147s
我尝试按照建议使用并行.但是,即使块大小非常小,它也因 fgrep: memory expired
错误而失败.
I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
令我惊讶的是 fgrep
完全不适合这个.我在 22 小时后中止了它,它产生了大约 10 万个匹配项.我希望 fgrep
有一个选项可以强制将 -f 文件
的内容保存在一个散列中,就像 Perl 代码所做的那样.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
我没有检查 join
方法 - 我不想要对文件进行排序的额外开销.此外,鉴于 fgrep
的性能不佳,我认为 join
不会比 Perl 代码做得更好.
I didn't check join
approach - I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
感谢大家的关注和回复.
Thanks everyone for your attention and responses.
这篇关于在 Bash 中从另一个较大文件中查找文件行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!