从基于行的文件中awk匹配模式并输出为CSV [英] awk matching patterns from row based file and output as CSV

查看:197
本文介绍了从基于行的文件中awk匹配模式并输出为CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其中记录的格式如下:

I've a file with records in this type of format:

LOCUS       NG_029783              19834 bp    DNA     linear   PRI 03-OCT-2014 DEFINITION  Homo sapiens long intergenic non-protein coding RNA 1546
            (LINC01546), RefSeqGene on chromosome X. ACCESSION   NG_029783 VERSION     NG_029783.1 KEYWORDS    RefSeq; RefSeqGene. SOURCE      Homo sapiens (human)   ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo. COMMENT     VALIDATED REFSEQ: This record has undergone validation or
            preliminary review. The reference sequence was derived from
            AC004616.1.
            This sequence is a reference standard in the RefSeqGene project. PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-19834             AC004616.1         8636-28469 FEATURES             Location/Qualifiers
     source          1..19834
                     /organism="Homo sapiens"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:9606"
                     /chromosome="X"
                     /map="Xp22.33"
     variation       4
                     /replace="c"
                     /replace="t"
                     /db_xref="dbSNP:1205550"
     variation       17
                     /replace="c"
                     /replace="t"
                     /db_xref="dbSNP:1205551"
     gene            5001..5948
                     /gene="OR6K3"
                     /gene_synonym="OR1-18"
                     /note="olfactory receptor family 6 subfamily K member 3"
                     /db_xref="GeneID:391114"
                     /db_xref="HGNC:HGNC:15030"
     mRNA            5001..5948
                     /gene="OR6K3"
                     /gene_synonym="OR1-18"
                     /product="olfactory receptor family 6 subfamily K member

// 
LOCUS       NG_032962              70171 bp    DNA     linear   PRI 17-JUN-2016 DEFINITION  Homo sapiens death domain containing 1 (DTHD1), RefSeqGene on
                chromosome 4. ACCESSION   NG_032962 VERSION     NG_032962.1 KEYWORDS    RefSeq; RefSeqGene. SOURCE      Homo sapiens (human)   ORGANISM  Homo sapiens
                Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
                Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
                Catarrhini; Hominidae; Homo. COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
                reference sequence was derived from AC104078.3.
                This sequence is a reference standard in the RefSeqGene project.

        Summary: This gene encodes a protein which contains a death domain.
        Death domain-containing proteins function in signaling pathways and
        formation of signaling complexes, as well as the apoptosis pathway.
        Alternative splicing results in multiple transcript variants.
        [provided by RefSeq, Oct 2012]. PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
        1-70171             AC104078.3         59395-129565 FEATURES             Location/Qualifiers
 source          1..70171
                 /organism="Homo sapiens"
                 /mol_type="genomic DNA"
                 /db_xref="taxon:9606"
                 /chromosome="4"
                 /map="4p14"
 gene            5001..129091
                 /gene="REEP1"
                 /gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
                 /note="receptor accessory protein 1"
                 /db_xref="GeneID:65055"
                 /db_xref="HGNC:HGNC:25786"
                 /db_xref="MIM:609139"
 mRNA            join(5001..5060,60842..60914,79043..79119,88270..88390,
                 91014..91127,110282..110459,125974..129091)
                 /gene="REEP1"
                 /gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
                 /product="receptor accessory protein 1, transcript variant
                 1"
                 /transcript_id="NM_001164730.1"

我一直在使用此工作流程:

Ive been using this workflow:

  1. 删除空格 gawk '{$1=$1}1' raw_file.text > temp_file.txt

  1. remove spaces gawk '{$1=$1}1' raw_file.text > temp_file.txt

匹配摘要"内容gawk /Summary/,/\]/{print} temp_file.text > summary_temp.txt

删除新行gawk 'BEGIN {RS=""}{gsub(/\n/,"",$0); print $0}' summary_temp.text > summary.txt

我有几个问题. 首先,我如何结合这三个步骤. 其次,如何选择一个或多个其他匹配项,例如匹配项'/gene ="AP3B2"'(需要匹配"gene"之后的"/gene"的第一个实例),以便我可以在其中输出内容形式:

I've a couple of questions. First, how could I combine those 3 steps. Second, how could I also select one or more additional matches, for example match '/gene="AP3B2"' (which requires matching the first instance of "/gene" after "gene") so that I could output the contents in this form:

基因,摘要

推荐答案

$ cat tst.awk
BEGIN{RS="//"}
{
  match($0, /\/gene="([^"]+)"/, a)
  print a[1] ", ", 
        gensub(/\s\s+/, "", "g", gensub(/.*Summary:\s([^\[]+).*/, "\\1", "g"))
}

说明:

match($0, /\/gene="([^"]+)"/, a)

捕获数组a中的所有"\ gene"部分.根据您的问题,只需要第一次出现,即a [1](而不是AP3B2,btw).

catches all "\gene" parts in array a. According to your question only the first occurrence is needed, which is a[1] (and is not AP3B2, btw).

gensub(/.*Summary:\s([^\[]+).*/, "\\1", "g")

捕获摘要:"之后的所有内容,直到找到"["为止. 最后的结果包含空格和换行符.让我们摆脱它们:

catches everything after "Summary: " until you find a "[". This last result has spaces and newlines. Let's get rid of them:

gensub(/\s\s+/, "", "g", <<result of 1st gensub>>)

并非每条记录都包含一个摘要"

Not every record contains a "Summary"

将脚本更改为:

$ cat tst.awk
BEGIN{RS="//"}
{
  match($0, /\/gene="([^"]+)"/, a)
  match($0, /Summary:\s([^\[]+)/, b)
  print a[1] ",",
    gensub(/\s\s+/, " ", "g", b[1])
}

使用OP提供的输入来运行脚本:

Running the script with input provided by OP:

awk -f tst.awk tst.txt
OR6K3,
REEP1,  This gene encodes a protein which contains a death domain.Death 
domain-containing proteins function in signaling pathways andformation 
of signaling complexes, as well as the apoptosis pathway.Alternative 
splicing results in multiple transcript variants.

这篇关于从基于行的文件中awk匹配模式并输出为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆