从Bash中另一个更大文件中查找文件行的最快方法 [英] Fastest way to find lines of a file from another larger file in Bash
问题描述
我有两个文件, 我想从 file1.txt
和 file2.txt
。 file1.txt
大约有14K行,而 file2.txt
大约有20亿。 file1.txt
每行有一个字段 f1
,而 file2.txt
包含3个字段, |
f3
,由 | $ c分隔$ c $。
file2.txt找到所有行
其中 f1
file1.txt
匹配 f2
of file2.txt
file2.txt
的值),那么在线上的任何地方。
file1.txt(约14K行,未排序):
foo1
foo2
...
bar1
bar2
...
file2.txt(大约20亿行,不排序):
date1 | foo1 | number1
date2 | foo2 | number2
...
date1 | bar1 | number1
date2 | bar2 | number2
.. 。
预计产出:
date1 | foo1 | number1
date2 | foo2 | number2
...
date1 | bar1 | nu mber1
date2 | bar2 | number2
...
这是我已经尝试过,似乎需要花费几个小时才能运行:
fgrep -F -f file1.txt file2.txt> file.matched
我想知道是否有更好更快的方式来使用普通Unix命令或一个小脚本。
一小段Perl代码解决了这个问题。这是采取的方法:
file1.txt
的行存储在a hash
file2.txt
逐行解析并提取第二个字段
以下是代码:
<$ p $ #!/ usr / bin / perl -w
use strict;
if(标量(@ARGV)!= 2){
printf STDERR用法:fgrep.pl smallfile bigfile \\\
;
exit(2);
}
my($ small_file,$ big_file)=($ ARGV [0],$ ARGV [1]);
my($ small_fp,$ big_fp,%small_hash,$ field);
打开($ small_fp,<,$ small_file)||死无法打开$ small_file:。 $ !;
open($ big_fp,<,$ big_file)||死无法打开$ big_file:。 $ !;
#将小文件的内容存储在散列
中,(< $ small_fp>){
chomp;
$ small_hash {$ _} = undef;
}
close($ small_fp); $(< $ big_fp>){
#不需要chomp
$ field =(split(/ \\ b
$ b# \\ | /,$ _))[1];
if(defined($ field)&& amp; exists($ small_hash {$ field})){
printf(%s,$ _);
}
}
close($ big_fp);
exit(0);
我在file1中运行了14K行的上述脚本.txt和file2.txt中的1.3M行。它在大约13秒内完成,产生了126K场比赛。以下是同样的时间
输出:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
我跑@Inian的< code $ awk code:
awk'FNR == NR {hash [$ 1] ;下一个} {for(我在哈希)if(match($ 0,i)){print; break}}'file1.txt FS ='|'file2.txt
它比Perl解决方案,因为它在file2.txt中为每行循环14K次 - 这非常昂贵。它在处理 file2.txt
的592K条记录并生成40K匹配行后中止。下面是它花了多少时间:
awk:正则表达式中的非法主数据24 / Nov / 2016 || 592989 at 592989
输入记录号码675280,文件file2.txt
源码行号码1
真实55m5.539s
用户54m53.080s
sys 0m5.095s
使用@Inian的另一个 awk
解决方案,它消除了循环问题:
time awk -F'|''FNR == NR {hash [$ 1];下一个}哈希值为$ 2 file1.txt FS ='|'file2.txt> awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL = C awk -F'|'' FNR == NR {散列[$ 1];下一个}哈希值为$ 2 file1.txt FS ='|'file2.txt> awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
在这里给人留下深刻印象,因为我们不需要编写整个程序来完成它。
我也运行了@ oliv的Python代码。完成这项工作花了大约15个小时,看起来它产生了正确的结果。构建一个巨大的正则表达式并不像使用哈希查找那样高效。这里时间
输出:
real 895m14.862s
用户806m59.219s
sys 1m12.147s
我试着按照建议使用<一个href =https://stackoverflow.com/a/42309876/6862601>并行。但是,即使块尺寸很小,它也会因 fgrep:memory exhausted
错误而失败。
令我惊讶的是 fgrep
完全不适合这个。 22小时后我放弃了它,并且产生了大约100K场比赛。 我希望 fgrep
有一个选项可以强制 -f file
的内容保留在散列表中,就像Perl代码所做的一样。
我没有检查 join
方法 - 我没有不需要额外的文件排序开销。另外,如果 fgrep
的性能不佳,我不认为 join
会比Perl代码更好。
感谢大家的关注和回复。
I have two files, file1.txt
and file2.txt
. file1.txt
has about 14K lines and file2.txt
has about 2 billions. file1.txt
has a single field f1
per line while file2.txt
has 3 fields, f1
through f3
, delimited by |
.
I want to find all lines from file2.txt
where f1
of file1.txt
matches f2
of file2.txt
(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt
).
file1.txt (about 14K lines, not sorted):
foo1
foo2
...
bar1
bar2
...
file2.txt (about 2 billion lines, not sorted):
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Output expected:
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Here is what I have tried and it seems to be taking several hours to run:
fgrep -F -f file1.txt file2.txt > file.matched
I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.
A small piece of Perl code solved the problem. This is the approach taken:
- store the lines of
file1.txt
in a hash - read
file2.txt
line by line, parse and extract the second field - check if the extracted field is in the hash; if so, print the line
Here is the code:
#!/usr/bin/perl -w
use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/\|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time
output for the same:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
I ran @Inian's awk
code:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt
and producing 40K matched lines. Here is how long it took:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1
real 55m5.539s
user 54m53.080s
sys 0m5.095s
Using @Inian's other awk
solution, which eliminates the looping issue:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
is very impressive here, given that we didn't have to write an entire program to do it.
I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s
user 806m59.219s
sys 1m12.147s
I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
I didn't check join
approach - I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
Thanks everyone for your attention and responses.
这篇关于从Bash中另一个更大文件中查找文件行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!