如何使用awk在每个模式之后进行多次匹配并打印不同数量的行 [英] How to do multiple match and print different number of lines after each pattern using awk
问题描述
我有一个包含数千行的大文件,看起来像:
I have a big file with thousand lines that looks like:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00002235.4
TTACGCAT
TAGGCCAG
>ENST00005546.9
TTTATCGC
TTAGGGTAT
我想grep特定ID(在>
符号之后),例如 ENST00001234.1
,然后想要在比赛之后获得行,直到下一个>
[不管行数是多少].我想立即以这种方式grep约63个ID.
I want to grep specific ids (after >
sign), for example, ENST00001234.1
then want to get lines after the match until the next >
[regardless of the number of lines]. I want to grep about 63 ids in this way at once.
如果我grep ENST00001234.1
和 ENST00005546.9
id,理想的输出应该是:
If I grep ENST00001234.1
and ENST00005546.9
ids, the ideal output should be:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
我尝试 awk'/ENST00001234.1/ENST00005546.9/{print}'
,但这没有帮助.
I tried awk '/ENST00001234.1/ENST00005546.9/{print}'
but it did not help.
推荐答案
您可以将>
设置为记录分隔符:
You can set >
as the record separator:
$ awk -F'\n' -v RS='>' -v ORS= '$1=="ENST00001234.1"{print RS $0}' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
-
-F'\ n'
,以便更轻松地将搜索字词与第一行进行比较 -
-v RS ='>'
设置>
作为输入记录分隔符 -
-v ORS =
清除输出记录分隔符,否则您将在输出中获得额外的换行符 -
$ 1 =="ENST00001234.1"
这将进行字符串比较并匹配整个第一行,否则您必须转义正则表达式元字符,例如.
和添加锚点 -
打印RS $ 0
(如果找到匹配项),打印>
和记录内容 -F'\n'
to make it easier to compare the search term with first line-v RS='>'
set>
as input record separator-v ORS=
clear the output record separator, otherwise you'll get extra newline in the output$1=="ENST00001234.1"
this will do string comparison and matches the entire first line, otherwise you'll have to escape regex metacharacters like.
and add anchorsprint RS $0
if match is found, print>
and the record content
如果要匹配多个搜索词,请将它们放在文件中:
If you want to match more than one search terms, put them in a file:
$ cat f1
ENST00001234.1
ENST00005546.9
$ awk 'BEGIN{FS="\n"; ORS=""}
NR==FNR{a[$0]; next}
$1 in a{print RS $0}' f1 RS='>' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
此处, f1
的内容用于构建数组 a
的键.读取第一个文件后, RS ='>'
将更改第二个文件的记录分隔符.
Here, the contents of f1
is used to build the keys for array a
. Once the first file is read, RS='>'
will change the record separator for the second file.
$ 1 in
中的第一行将检查第一行是否与数组 a
$1 in a
will check if the first line matches a key in array a
这篇关于如何使用awk在每个模式之后进行多次匹配并打印不同数量的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!