awk:提取多行数据 [英] awk : extracting a data which is on several lines

查看:849
本文介绍了awk:提取多行数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个看起来像这样的文件:

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

我需要提取在"/product ="之后的引号之间的文本,所以我需要这样做:

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

我必须使用awk,所以我这样写:

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

但这仅将信息与"/product"放在同一行,有时信息在两或三行..我对于如何在引号之间获取整个信息不了解,有人可以帮忙吗?

解决方案

awk进行救援!需要多字符RS支持(gawk)

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

说明 设置记录结构(以"/"或"CDS"开头,查找相关记录(以产品开头),修剪多余的空格并在两个引号之间打印字段(第二个字段基于将字段定界符设置为双引号).

so I have a file that looks like this :

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

I need to extract the text that is between quotes after a "/product=", so I need this :

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

I have to use awk, so I wrote this :

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

but this only takes the info on the same line as "/product", and some times the info is on two or three lines.. I'm out of ideas as to how to get the entire info between the quotes, anyone can help?

解决方案

awk to the rescue! needs multi-char RS support (gawk)

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Explanation set the record structure (either starts with "/" or " CDS", find related records (starting with product), trim extra spaces and print the field between two quotes (second field based on set field delimiter to double quotes).

这篇关于awk:提取多行数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆