桑达。如何删除线匹配模式和字符串角落找寻呢? [英] Sed. How to remove line match with pattern and strings arround it?
问题描述
我有,你想通过模式删除线匹配,并删除上面和下面的字符串的文件。
举例:
FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<< BBBBBB7B<&BBBBBB LT; B<
@HISEQ:102:h9u5badxx:1 1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFFFFFFFBBFFFFFFFFFFFFFBFBFFFFFFFFFBFFFBFFFFFBFFFFFFFFFBFB
@HISEQ:102:h9u5badxx:1:1101:15368:2194 1:N:0:CTGT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700&所述; 7770&下; BBB0&℃,下; BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1 1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
我想删除不含有的核苷酸序列的第二块
最终结果是:
`FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<< BBBBBB7B<&BBBBBB LT; B<
@HISEQ:102:h9u5badxx:1:1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700&所述; 7770&下; BBB0&℃,下; BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1:1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
`
模式匹配这此块
'^ + $(\\ n)的^(@ HISEQ)。* $(\\ n)的^ \\ +'
工作在 perl的和的JavaScript ,但不是 SED
由于 SED 不换行工作。
我找到了解决办法。
SED -e':一个; N; $ BA; S / \\ n / /'测试
但是,这code替换换行符空间。如果插入到此code我的正则表达式:
SED -e':!一个; N; $ BA; /^.+$(\\ n)的^(@ HISEQ)* $(\\ n)的^ \\ + / D'测试
这是行不通的。
你能帮我找到这个问题的解决?
我只是愚蠢。我误解的文件格式。
输入:
@ HWI-ST383:199:D1L73ACXX:3 1101:1309:1956 1:N:0:ACAGTGA
+
JJJHIIJFIJJJJ = BFFFFFEEEEEEDDDDDDDDDDBD
@ HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6 FAFEC @ = C @ 1AE ###############
如何编辑定期向EXP得到你想要的东西。
输出:
@ HWI-ST383:199:D1L73ACXX:3 1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6 FAFEC @ = C @ 1AE ###############
如果我理解正确,那么
sed的':循环; N; / \\ N + /! {$! b循环}; / \\ n @ HISEQ [^ \\ n] \\ + \\ N + / D'foo.txt的
将工作。这是如下:
:环#在循环
N#获取更多线
/ \\ N + /! {$! b循环}#直到一个与+开头或最后一行
/ \\ n @ HISEQ [^ \\ n] \\ + \\ n如果所有的倒数第二行与@HISEQ开始+ / D#,
#放弃很多。
这最后一个模式是利用它与 +
开头的第一个行之后检查的事实被发现,因此 \\ N +
在它的结束唯一匹配块中的最后一行的开始。
I have a file where you want to delete line matching by pattern and remove strings above and below.
By example:
FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<<BBBBBB7B<BBBBBB<B<
@HISEQ:102:h9u5badxx:1:1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFFFFFFFBBFFFFFFFFFFFFFBFBFFFFFFFFFBFFFBFFFFFBFFFFFFFFFBFB
@HISEQ:102:h9u5badxx:1:1101:15368:2194 1:N:0:CTGT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700<7770<BBB0<0<BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1:1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
I want to remove second block which does not contain the nucleotide sequence.
The end result:
`FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<<BBBBBB7B<BBBBBB<B<
@HISEQ:102:h9u5badxx:1:1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700<7770<BBB0<0<BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1:1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
`
Pattern which matched this block
'^.+$(\n)^(@HISEQ).*$(\n)^\+'
works in perl and javascript, but not sed.
Because sed does not work with line break.
I found the solution
sed -e ':a;N;$!ba;s/\n/ /' test
But this code replace line break to space. If insert to this code my regexp:
sed -e ':a;N;$!ba;/^.+$(\n)^(@HISEQ).*$(\n)^\+/d' test
this does not work. Can you help me find the solution of this problem?
I'm just stupid. I misunderstood the file format. Input:
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
+
JJJHIIJFIJJJJ=BFFFFFEEEEEEDDDDDDDDDDBD
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
How to edit the regular exp to get what you want
output:
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
If I understand you correctly, then
sed ':loop; N; /\n+/ ! { $ ! b loop }; /\n@HISEQ[^\n]\+\n+/ d' foo.txt
will work. This is as follows:
:loop # in a loop
N # fetch more lines
/\n+/ ! { $ ! b loop } # until one starts with + or is the last line
/\n@HISEQ[^\n]\+\n+/ d # if the penultimate line of all that begins with @HISEQ,
# discard the lot.
That last pattern is using the fact that it is checked right after the first line that begins with +
is found, so the \n+
at the end of it uniquely matches the start of the last line in the block.
这篇关于桑达。如何删除线匹配模式和字符串角落找寻呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!