提取另一个文件将信息从一个文件中的条件行和子 [英] Extract rows and substrings from one file conditional on information of another file

查看:228
本文介绍了提取另一个文件将信息从一个文件中的条件行和子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件 1.blast 与协调这样的信息

I have a file 1.blast with coordinate information like this

1       gnl|BL_ORD_ID|0 100.00  33      0       0       1        3
27620   gnl|BL_ORD_ID|0 95.65   46      2       0       1       46
35296   gnl|BL_ORD_ID|0 90.91   44      4       0       3       46
35973   gnl|BL_ORD_ID|0 100.00  45      0       0       1       45
41219   gnl|BL_ORD_ID|0 100.00  27      0       0       1       27
46914   gnl|BL_ORD_ID|0 100.00  45      0       0       1       45 

和文件 1.fasta 像这样的序列信息

and a file 1.fasta with sequence information like this

>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG

我现在正在寻找一个脚本,从 1.blast 取第一列,并提取这些序列的ID(=第一列 $ 1 ),加上序列,然后从序列本身所有,但来自之间的 $ 7 $ 8个的位置 1.fasta 文件,从第一个两场比赛的输出将意味着

I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be

>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...

(请注意,从第三个条目> 1 不按照这个顺序)

的ID是连续的,这意味着我可以提取像这样的所需资料:

The IDs are consecutive, meaning I can extract the required information like this:

awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast

此使我那时,在第一列中包含正确序列标识符列的矩阵,在第二列右侧序列行(= 1的ID行之后),然后应被排除的两个坐标。所以基本上包含从 1.fasta 元素必须提取所有所需信息的矩阵

This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted

不幸的是我没有与脚本太多的经验,所以我现在有点失落,怎么我如喂值在合适的 SED 命令?
我能得到这样的特定行:

Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command? I can get specific rows like this:

sed -n 3,4p 1.fasta

和我想删除例如字符串通过

and the string that I want to remove e.g. via

sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'

但现在我的问题是,我怎么能管从信息的第一 AWK 调入其他命令,让他们提取合适的行,并从该序列行删除然后给定的坐标。因此, SUBSTR 是不正确的命令,我需要一个命令 remstr(字符串,开始,停止)的除去给定字符串这两个位置之间的一切,但我认为我可以在自己的脚本做。尤其是正确的管道是个问题来到我身边。

But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.

推荐答案

由于任 MSW 所指出的,更合适的工具可用于这样的任务但在这里你有一个脚本,可以教你一些关于如何使用处理它 AWK

As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:

内容的 script.awk

## Process first file from arguments.
FNR == NR {
        ## Save ID and the range of characters to remove from sequence.
        blast[ $1 ] = $(NF-1) " " $NF
        next
}

## Process second file. For each FASTA id...
$1 ~ /^>/ {
        ## Get number.
        id = substr( $1, 2 )

        ## Read next line (the sequence).
        getline sequence

        ## if the ID is one found in the other file, get ranges and
        ## extract those characters from sequence.
        if ( id in blast ) {
                split( blast[id], ranges )
                sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
                ## Print both lines with the shortened sequence.
                printf "%s\n%s\n", $0, sequence
        }

}

假设你的 1.blasta 的问题与定制 1.fasta 来测试它:

Assuming your 1.blasta of the question and a customized 1.fasta to test it:

>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA 

运行像脚本:

awk -f script.awk 1.blast 1.fasta

这收益率:

>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA

当然,我assumming一些东西,最重要的是FASTA序列不超过一行。

Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.

这篇关于提取另一个文件将信息从一个文件中的条件行和子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆