提取具有与特定模式匹配的值的列 [英] Extract columns with values matching a specific pattern

查看：62 发布时间：2020/9/15 8:24:54 linux r sed awk

本文介绍了提取具有与特定模式匹配的值的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个多列GTF文件，其中每一行都有不同的列数:

I have a multi-column GTF file, where each row has different number of columns:

chr1    Cufflinks   exon    12659   12721   .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS1";
chr1    Cufflinks   exon    13221   16604   .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS1";
chr1    Cufflinks   exon    29554   30039   .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "MIR1302-11"; oId "ENST00000473358"; nearest_ref "ENST00000473358"; class_code "="; tss_id "TSS2";
chr1    Cufflinks   exon    30564   30667   .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "MIR1302-11"; oId "ENST00000473358"; nearest_ref "ENST00000473358"; class_code "="; tss_id "TSS2";
chr1    Cufflinks   exon    69091   70008   .   +   .   gene_id "XLOC_000003"; transcript_id "TCONS_00000005"; exon_number "1"; gene_name "OR4F5"; oId "ENST00000335137"; nearest_ref "ENST00000335137"; class_code "="; tss_id "TSS4"; p_id "P1";

我只想要与模式'gene_id"...";相匹配的列； 'transcript_id"...";' '班级代码 "..";'

I only want columns matching the pattern 'gene_id "...";' 'transcript_id "...";' 'class_code "..";'

我尝试使用以下方法删除不需要的列:

I tried removing the unwanted columns using:

sed -e 's/nearest_ref\s\"[A-Z]\{4\}[0-9]\{11\}\"\;//' -e 's/oId\s\"[A-Z|\.|0-9]*\"\;//' -e 's/gene_name\s\"[A-Z|0-9|\.|\-]*\"\;//' -e 's/contained_in\s\"[A-Z|\_|0-9]*\"\;//' -e 's/p_id*\s\".*\"\;//' merged.gtf > temp.gtf

但是看起来文件中还有许多其他我看不到的不需要的列(文件很大).如何提取所需的列并将其保存到另一个文件中?

But looks like there are many other unwanted columns in the file that I cannot see (the file is huge). How do I extract the desired columns and save it into another file?

推荐答案

如果您不介意多余的尾随空格，并且我在上面的评论中的假设是正确的，那么以下方法应该有效:

If you don't mind an extra trailing space, and my assumptions in my comment above are true, then the following should work:

awk '{
    for (i = 1; i <= NF; i++) {
        if ($i ~ /gene_id|transcript_id|class_code/) {
            printf "%s %s ", $i, $(i + 1)
        }
    }
    print ""
}' merged.gtf > temp.gtf

这篇关于提取具有与特定模式匹配的值的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提取具有与特定模式匹配的值的列 [英] Extract columns with values matching a specific pattern

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

提取具有与特定模式匹配的值的列 [英] Extract columns with values matching a specific pattern

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭