删除重复PCR从FASTQ文件包含独特的分子标识符 [英] Removing PCR Duplicates From Fastq File Containing Unique Molecular Identifiers
问题描述
我想编辑包含基因组数据和独特的分子标识每个侧翼序列的FASTQ文件。
前两个的例子读取显示如下:
1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
5 @HISEQ:230:C6G45ANXX:3 1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
6 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
7 +
8 BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################
下面这些线路进行了说明:
1信息
2序列
3 +
4质量评分
5信息
6序列
7 +
8质量评分
我需要,其中给定的序列(及其相应的信息)的所有确切的重复已被删除的输出文件。也就是说,我需要删除4条线路,其中:第二个已经出现在文件中的块。
因此,在上面的例子中,因为该序列中的行2和6相匹配的输出文件应包含行1,2,3,和4,但不5,6,7和8。
生成的输出文件:
1 @HISEQ:230:C6G45ANXX:3 1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
这似乎是我们通过文件循环两次完美的情况:首先是计算重复,然后再打印此时,相应的行:
的awk'FNR == {NR
如果(FNR%4 == 2){
一[$ 2] ++
如果(一个[$ 2]→1)B〔INT(FNR / 4)] = 1
}
下一个}
B〔INT(FNR / 4)] == 0'file文件
这里的关键是与4K + 2行文件中发挥和跟踪哪些那些迄今已出现了。如果他们这样做,我们店的 K
(从 4K + 2
),以便在文件的下一个循环中,我们避开那些被表单上线 4K + 0/1/2/3
。
为了清楚我以为在第一列中的行不存在(我不知道,如果他们加入到澄清还是真的存在)。删除它们应该是微不足道的。
测试
$ AWK'FNR == {NR如果(FNR%4 == 2){a [$ 2] ++;如果(A [$ 2]→1)B〔INT(FNR / 4)] = 1}}旁边B〔INT(FNR / 4)] == 0'一
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################
I am trying to edit a Fastq file containing genomic data and Unique Molecular Identifiers flanking each sequence.
An example of the first two reads are shown below:
1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
5 @HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
6 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
7 +
8 BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################
These lines are explained below:
1 Information
2 Sequence
3 +
4 Quality Scoring
5 Information
6 Sequence
7 +
8 Quality Scoring
I need an output file in which all exact repeats of a given sequence (and its corresponding information) have been removed. That is, I need to remove those blocks of 4 lines in which the 2nd one have already appeared in the file.
So that in the above example because the sequence matches in lines 2 and 6 the output file should contain lines 1,2,3, and 4 but not 5,6,7, and 8.
Resulting output file:
1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
This seems to be the perfect case in which we loop through the file twice: firstly to calculate duplicates and then to print the appropiate lines:
awk 'FNR==NR {
if (FNR%4==2) {
a[$2]++
if (a[$2]>1) b[int(FNR/4)]=1
}
next}
b[int(FNR/4)]==0' file file
The key here is to play with the 4K+2 lines in the file and keep track of which ones have appeared so far. If they do, we store the K
(from 4K+2
) so that in the next loop of the file we avoid those lines being on the form 4K+0/1/2/3
.
For clarity I assumed the lines in the first column are not there (I didn't know if they were added to clarify or are really there). Removing them should be trivial.
Test
$ awk 'FNR==NR {if (FNR%4==2) {a[$2]++; if (a[$2]>1) b[int(FNR/4)]=1} next} b[int(FNR/4)]==0' a a
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################
这篇关于删除重复PCR从FASTQ文件包含独特的分子标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!