awk:提取多行数据 [英] awk : extracting a data which is on several lines
问题描述
所以我有一个看起来像这样的文件:
/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
LITPRAAVPALKRPALKASLPASSSHGNWETF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
CDS complement(471..590)
/db_xref="SEED:fig|1240086.14.peg.2"
/translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
/product="hypothetical protein"
CDS 717..2354
/db_xref="SEED:fig|1240086.14.peg.3"
/translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
/product="macromolecule metabolism; macromolecule
degradation; degradation of proteins, peptides,
glycopeptides"
我需要提取在"/product ="之后的引号之间的文本,所以我需要这样做:
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
我必须使用awk,所以我这样写:
awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'
但这仅将信息与"/product"放在同一行,有时信息在两或三行..我对于如何在引号之间获取整个信息不了解,有人可以帮忙吗?
awk
进行救援!需要多字符RS
支持(gawk
)
$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
说明 设置记录结构(以"/"或"CDS"开头,查找相关记录(以产品开头),修剪多余的空格并在两个引号之间打印字段(第二个字段基于将字段定界符设置为双引号).>
so I have a file that looks like this :
/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
LITPRAAVPALKRPALKASLPASSSHGNWETF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
CDS complement(471..590)
/db_xref="SEED:fig|1240086.14.peg.2"
/translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
/product="hypothetical protein"
CDS 717..2354
/db_xref="SEED:fig|1240086.14.peg.3"
/translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
/product="macromolecule metabolism; macromolecule
degradation; degradation of proteins, peptides,
glycopeptides"
I need to extract the text that is between quotes after a "/product=", so I need this :
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
I have to use awk, so I wrote this :
awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'
but this only takes the info on the same line as "/product", and some times the info is on two or three lines.. I'm out of ideas as to how to get the entire info between the quotes, anyone can help?
awk
to the rescue! needs multi-char RS
support (gawk
)
$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
Explanation set the record structure (either starts with "/" or " CDS", find related records (starting with product), trim extra spaces and print the field between two quotes (second field based on set field delimiter to double quotes).
这篇关于awk:提取多行数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!