删除重复PCR从FASTQ文件包含独特的分子标识符 [英] Removing PCR Duplicates From Fastq File Containing Unique Molecular Identifiers

查看:470
本文介绍了删除重复PCR从FASTQ文件包含独特的分子标识符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编辑包含基因组数据和独特的分子标识每个侧翼序列的FASTQ文件。

前两个的例子读取显示如下:

  1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
5 @HISEQ:230:C6G45ANXX:3 1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
6 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
7 +
8 BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################

下面这些线路进行了说明:

  1信息
2序列
3 +
4质量评分
5信息
6序列
7 +
8质量评分

我需要,其中给定的序列(及其相应的信息)的所有确切的重复已被删除的输出文件。也就是说,我需要删除4条线路,其中:第二个已经出现在文件中的块。

因此​​,在上面的例子中,因为该序列中的行2和6相匹配的输出文件应包含行1,2,3,和4,但不5,6,7和8。

生成的输出文件:

  1 @HISEQ:230:C6G45ANXX:3 1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B


解决方案

这似乎是我们通过文件循环两次完美的情况:首先是计算重复,然后再打印此时,相应的行:

 的awk'FNR == {NR
          如果(FNR%4 == 2){
              一[$ 2] ++
              如果(一个[$ 2]→1)B〔INT(FNR / 4)] = 1
             }
          下一个}
      B〔INT(FNR / 4)] == 0'file文件

这里的关键是与4K + 2行文件中发挥和跟踪哪些那些迄今已出现了。如果他们这样做,我们店的 K (从 4K + 2 ),以便在文件的下一个循环中,我们避开那些被表单上线 4K + 0/1/2/3

为了清楚我以为在第一列中的行不存在(我不知道,如果他们加入到澄清还是真的存在)。删除它们应该是微不足道的。

测试

  $ AWK'FNR == {NR如果(FNR%4 == 2){a [$ 2] ++;如果(A [$ 2]→1)B〔INT(FNR / 4)] = 1}}旁边B〔INT(FNR / 4)] == 0'一
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################

I am trying to edit a Fastq file containing genomic data and Unique Molecular Identifiers flanking each sequence.

An example of the first two reads are shown below:

1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B
5 @HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
6 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
7 +
8 BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################

These lines are explained below:

1 Information
2 Sequence
3 +
4 Quality Scoring
5 Information
6 Sequence
7 +
8 Quality Scoring

I need an output file in which all exact repeats of a given sequence (and its corresponding information) have been removed. That is, I need to remove those blocks of 4 lines in which the 2nd one have already appeared in the file.

So that in the above example because the sequence matches in lines 2 and 6 the output file should contain lines 1,2,3, and 4 but not 5,6,7, and 8.

Resulting output file:

1 @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
2 TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
3 +
4 BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/77F/B/BF7/FF/<BF/7FFFFBBF//B7B

解决方案

This seems to be the perfect case in which we loop through the file twice: firstly to calculate duplicates and then to print the appropiate lines:

awk 'FNR==NR {
          if (FNR%4==2) {
              a[$2]++
              if (a[$2]>1) b[int(FNR/4)]=1
             }
          next}
      b[int(FNR/4)]==0' file file

The key here is to play with the 4K+2 lines in the file and keep track of which ones have appeared so far. If they do, we store the K (from 4K+2) so that in the next loop of the file we avoid those lines being on the form 4K+0/1/2/3.

For clarity I assumed the lines in the first column are not there (I didn't know if they were added to clarify or are really there). Removing them should be trivial.

Test

$ awk 'FNR==NR {if (FNR%4==2) {a[$2]++; if (a[$2]>1) b[int(FNR/4)]=1} next} b[int(FNR/4)]==0' a a
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACACAACATACATACAACACCTGGATCTGTGAATTAATTACTGCCTAGG
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF##################################

这篇关于删除重复PCR从FASTQ文件包含独特的分子标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆