从Bash中另一个更大文件中查找文件行的最快方法 [英] Fastest way to find lines of a file from another larger file in Bash

查看:120
本文介绍了从Bash中另一个更大文件中查找文件行的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件, file1.txt file2.txt file1.txt 大约有14K行,而 file2.txt 大约有20亿。 file1.txt 每行有一个字段 f1 ,而 file2.txt 包含3个字段, | f3 ,由 |

我想从 file2.txt找到所有行其中 f1 file1.txt 匹配 f2 of file2.txt / code>(或者如果我们不想额外花费额外的时间来分割 file2.txt 的值),那么在线上的任何地方。



file1.txt(约14K行,未排序):

  foo1 
foo2
...
bar1
bar2
...

file2.txt(大约20亿行,不排序):

  date1 | foo1 | number1 
date2 | foo2 | number2
...
date1 | bar1 | number1
date2 | bar2 | number2
.. 。

预计产出:

  date1 | foo1 | number1 
date2 | foo2 | number2
...
date1 | bar1 | nu mber1
date2 | bar2 | number2
...

这是我已经尝试过,似乎需要花费几个小时才能运行:

  fgrep -F -f file1.txt file2.txt> file.matched 

我想知道是否有更好更快的方式来使用普通Unix命令或一个小脚本。

解决方案

一小段Perl代码解决了这个问题。这是采取的方法:


  • file1.txt 的行存储在a hash

  • 阅读 file2.txt 逐行解析并提取第二个字段
  • 检查提取的字段是否在散列中;如果是这样,打印行



以下是代码:

<$ p $ #!/ usr / bin / perl -w

use strict;
if(标量(@ARGV)!= 2){
printf STDERR用法:fgrep.pl smallfile bigfile \\\
;
exit(2);
}

my($ small_file,$ big_file)=($ ARGV [0],$ ARGV [1]);
my($ small_fp,$ big_fp,%small_hash,$ field);

打开($ small_fp,<,$ small_file)||死无法打开$ small_file:。 $ !;
open($ big_fp,<,$ big_file)||死无法打开$ big_file:。 $ !;

#将小文件的内容存储在散列
中,(< $ small_fp>){
chomp;
$ small_hash {$ _} = undef;
}
close($ small_fp); $(< $ big_fp>){
#不需要chomp
$ field =(split(/ \\ b
$ b# \\ | /,$ _))[1];
if(defined($ field)&& amp; exists($ small_hash {$ field})){
printf(%s,$ _);
}
}

close($ big_fp);
exit(0);






我在file1中运行了14K行的上述脚本.txt和file2.txt中的1.3M行。它在大约13秒内完成,产生了126K场比赛。以下是同样的时间输出:

  real 0m11.694s 
user 0m11.507s
sys 0m0.174s

我跑@Inian的< code $ awk code:

  awk'FNR == NR {hash [$ 1] ;下一个} {for(我在哈希)if(match($ 0,i)){print; break}}'file1.txt FS ='|'file2.txt 

它比Perl解决方案,因为它在file2.txt中为每行循环14K次 - 这非常昂贵。它在处理 file2.txt 的592K条记录并生成40K匹配行后中止。下面是它花了多少时间:

  awk:正则表达式中的非法主数据24 / Nov / 2016 || 592989 at 592989 
输入记录号码675280,文件file2.txt
源码行号码1

真实55m5.539s
用户54m53.080s
sys 0m5.095s

使用@Inian的另一个 awk 解决方案,它消除了循环问题:

  time awk -F'|''FNR == NR {hash [$ 1];下一个}哈希值为$ 2 file1.txt FS ='|'file2.txt> awk1.out 

real 0m39.966s
user 0m37.916s
sys 0m0.743s

time LC_ALL = C awk -F'|'' FNR == NR {散列[$ 1];下一个}哈希值为$ 2 file1.txt FS ='|'file2.txt> awk.out

real 0m41.057s
user 0m38.475s
sys 0m0.904s

awk 在这里给人留下深刻印象,因为我们不需要编写整个程序来完成它。



我也运行了@ oliv的Python代码。完成这项工作花了大约15个小时,看起来它产生了正确的结果。构建一个巨大的正则表达式并不像使用哈希查找那样高效。这里时间输出:

  real 895m14.862s 
用户806m59.219s
sys 1m12.147s

我试着按照建议使用<一个href =https://stackoverflow.com/a/42309876/6862601>并行。但是,即使块尺寸很小,它也会因 fgrep:memory exhausted 错误而失败。




令我惊讶的是 fgrep 完全不适合这个。 22小时后我放弃了它,并且产生了大约100K场比赛。 我希望 fgrep 有一个选项可以强制 -f file 的内容保留在散列表中,就像Perl代码所做的一样。



我没有检查 join 方法 - 我没有不需要额外的文件排序开销。另外,如果 fgrep 的性能不佳,我不认为 join 会比Perl代码更好。



感谢大家的关注和回复。


I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt (about 14K lines, not sorted):

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

解决方案

A small piece of Perl code solved the problem. This is the approach taken:

  • store the lines of file1.txt in a hash
  • read file2.txt line by line, parse and extract the second field
  • check if the extracted field is in the hash; if so, print the line

Here is the code:

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);


I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time output for the same:

real    0m11.694s
user    0m11.507s
sys 0m0.174s

I ran @Inian's awk code:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

Using @Inian's other awk solution, which eliminates the looping issue:

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk is very impressive here, given that we didn't have to write an entire program to do it.

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

real    895m14.862s
user    806m59.219s
sys 1m12.147s

I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted error, even with very small block sizes.


What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.

这篇关于从Bash中另一个更大文件中查找文件行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆