快速替代到grep -f [英] Fast alternative to grep -f
问题描述
file.contain.query.txt
file.contain.query.txt
ENST001
ENST002
ENST003
file.to.search.in.txt
file.to.search.in.txt
ENST001 90
ENST002 80
ENST004 50
由于ENST003在第二次文件中没有条目,ENST004在第一个文件中没有进入预期的输出结果是:
Because ENST003 has no entry in 2nd file and ENST004 has no entry in 1st file the expected output is:
ENST001 90
ENST002 80
要在一个特定的文件中,我们通常可以使用下面的grep多查询:
To grep multi query in a particular file we usually do the following:
grep -f file.contain.query <file.to.search.in >output.file
因为我有像10000查询,并file.to.search.in几乎100000原料需要很长的时间才能完成(如5小时)。有没有一种快速替代到grep -f?
since I have like 10000 query and almost 100000 raw in file.to.search.in it takes very long time to finish (like 5 hours). Is there a fast alternative to grep -f ?
推荐答案
如果你想要一个纯Perl语言编写的选择,看了你的查询文件钥匙到哈希表,然后检查标准输入对那些键:
If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:
#!/usr/bin/env perl
use strict;
use warnings;
# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
while (<STDIN>) {
chomp $_;
my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
if (defined $keyring->{$key}) { print "$_\n"; }
}
您会使用它像这样:
lookup.pl < file.to.search.txt
一个哈希表可以利用的内存相当,但搜索速度更快(哈希表查找是在固定时间内),这是很方便的,因为你有10倍以上的键比商店查找。
A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.
这篇关于快速替代到grep -f的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!