将文件1的索引表示法与文件2的索引匹配,并拉出匹配的行 [英] match index notation of file1 to the index of file2 and pull out matching rows

查看:91
本文介绍了将文件1的索引表示法与文件2的索引匹配,并拉出匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

file1包含多个字母序列:

file1 contains multiple alphabetic sequences:

AETYUIOOILAKSJ
EAYEURIOPOSIDK
RYXURIAJSKDMAO
URITORIEJAHSJD
YWQIAKSJDHFKCM
HAJSUDIDSJSIAJ
AJDHDPFDIXSIBJ
JAQIAUXCNCVUFO

而file2包含我要提取并传输到另一个文件的序列的索引.例如,3T表示我想要来自file1的位置3处带有T的序列. 实际上,两个文件都非常大,具有数千个索引和序列.
file2:

while file2 contains indexes of the sequences which I want to pull out and transfer to another file. For example, 3T means I want the sequence with a T at position 3 from within file1. In reality both files are very large with thousands of indexes and sequences.
file2:

3T
10K
14D
1J

所需的输出:

AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

理想情况下,输出应匹配file2中索引的顺序.换句话说,第一个索引"3T"与序列"AETYUIOOILAKSJ"匹配,因此这是新文件中的第一个序列.

Ideally the output should match the order of indexes in file2. In other words the first index "3T" matches sequence "AETYUIOOILAKSJ" and thus this is the first sequence in the new file.

我尝试过的事情:

grep -f file2 file1
grep -fov file2 file1 # possibly to filter for those non-matching entries

我也使用了命令行工具sift,但仍然遇到困难. 谢谢

I have also used the command line tool sift but am still having difficulty. Thanks

推荐答案

$ cat tst.awk
NR==FNR {
    lgth = length($0)
    pos2char[substr($0,1,lgth-1)] = substr($0,lgth,1)
    next
}
{
    for (pos in pos2char) {
        if ( substr($0,pos,1) == pos2char[pos] ) {
            print
            next
        }
    }
}

$ awk -f tst.awk file2 file1
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

这篇关于将文件1的索引表示法与文件2的索引匹配,并拉出匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆