根据给定的ID列表从文本文件中提取所有行 [英] extract all lines from text file based on a given list of IDs
问题描述
我有2个文本文件. file1
包含ID列表:
I have 2 text files. file1
contains a list of IDs:
11002
10995
48981
79600
file2
:
10993 item 0
11002 item 6
10995 item 7
79600 item 7
439481 item 5
272557 item 7
224325 item 7
84156 item 6
572546 item 7
693661 item 7
.....
我试图从file2
中选择ID(第一列)在file1
中的所有行.当前,我正在做的是循环遍历第一个文件以创建一个正则表达式,例如:
I am trying to select all lines from file2
where the ID (first column) is in file1
. Currently, what I am doing is to loop through the first file to create a regex like:
^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b
然后运行:
grep '^11002\|^10995\|^48981|^79600' file2.txt
但是,当file1
中的ID数量太大(〜2000)时,正则表达式会变得很长,而grep
会变得很慢.还有另一种方法吗?我正在使用Perl + Awk + Unix.
But when the number of IDs in file1
is too large (~2000), the regular expression becomes quite long and grep
becomes slow. Is there another way? I am using Perl + Awk + Unix.
推荐答案
使用哈希表 .它可能会占用大量内存,但查找的时间是固定的.这是一个有效且正确的过程-不仅是一个过程,而且是有效且正确的-创建哈希表,使用file1
作为键,并使用file2
在哈希表中查找键.如果哈希表中有键,则该行将打印到标准输出:
Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1
as keys and file2
for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:
#!/usr/bin/env perl
use strict;
use warnings;
open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
chomp;
$keyRef->{$_} = 1;
}
close FILE1;
open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
chomp;
my ($testKey, $label, $count) = split("\t", $_);
if (defined $keyRef->{$testKey}) {
print STDOUT "$_\n";
}
}
close FILE2;
在Perl中有很多方法可以做同样的事情.就是说,我看中清晰性而不是花哨的晦涩难懂,因为您永远不知道何时必须返回Perl脚本并进行更改,而且它们很难按原样进行管理.一个人的意见.
There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.
这篇关于根据给定的ID列表从文本文件中提取所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!