根据给定的ID列表从文本文件中提取所有行 [英] extract all lines from text file based on a given list of IDs

查看:200
本文介绍了根据给定的ID列表从文本文件中提取所有行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个文本文件. file1包含ID列表:

I have 2 text files. file1 contains a list of IDs:

11002
10995
48981
79600

file2:

10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
.....

我试图从file2中选择ID(第一列)在file1中的所有行.当前,我正在做的是循环遍历第一个文件以创建一个正则表达式,例如:

I am trying to select all lines from file2 where the ID (first column) is in file1. Currently, what I am doing is to loop through the first file to create a regex like:

^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b

然后运行:

grep '^11002\|^10995\|^48981|^79600' file2.txt

但是,当file1中的ID数量太大(〜2000)时,正则表达式会变得很长,而grep会变得很慢.还有另一种方法吗?我正在使用Perl + Awk + ​​Unix.

But when the number of IDs in file1 is too large (~2000), the regular expression becomes quite long and grep becomes slow. Is there another way? I am using Perl + Awk + Unix.

推荐答案

使用哈希表 .它可能会占用大量内存,但查找的时间是固定的.这是一个有效且正确的过程-不仅是一个过程,而且是有效且正确的-创建哈希表,使用file1作为键,并使用file2在哈希表中查找键.如果哈希表中有键,则该行将打印到标准输出:

Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1 as keys and file2 for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
   chomp;
   $keyRef->{$_} = 1;
}
close FILE1;

open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
    chomp;
    my ($testKey, $label, $count) = split("\t", $_);
    if (defined $keyRef->{$testKey}) {
        print STDOUT "$_\n";
    }
}
close FILE2;

在Perl中有很多方法可以做同样的事情.就是说,我看中清晰性而不是花哨的晦涩难懂,因为您永远不知道何时必须返回Perl脚本并进行更改,而且它们很难按原样进行管理.一个人的意见.

There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.

这篇关于根据给定的ID列表从文本文件中提取所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆