简化列表/数组的元素,然后向它们添加增量标识符a,b,c,d ....等 [英] Simplifying elements of a list/array and then adding incremental identifiers a,b,c,d.... etc to them

查看:180
本文介绍了简化列表/数组的元素,然后向它们添加增量标识符a,b,c,d ....等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理.fasta文件的头文件(这是一个在遗传学/生物信息学中普遍使用的文件来存储DNA / RNA序列数据)。 Fasta文件的标题以>符号开头(给出特定的信息),后面跟着标题描述的下一行的实际序列数据。序列数据无限延伸,直到下一个标题及其相应序列之后的下一个\ n。例如:

 > scaffold1.1_size947603 
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG ....
> scaffold2.1_size747567 .2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA ....

等等...



所以,我正在和我一起工作的有机体的基因组的fasta头部有一个问题。不幸的是,解决这个问题所需要的perl专业知识似乎超出了我目前的技能水平:所以我希望有人能在这里向我展示如何做到这一点。

我的基因组由大约25000个fasta头文件和它们各自的序列组成,这些头文件在当前状态下给我带来很多麻烦,我正在尝试使用序列对齐方式,所以我必须大大简化它们。这是我头几个标题的例子:

 > scaffold1.1_size947603 
> scaffold10.1_size550551
> scaffold100.1_size305125:1-38034
> scaffold100.1_size305125:38147-38987
> scaffold100.1_size305125:38995-44965
> scaffold100.1_size305125:76102-78738
> scaffold100.1_size305125:84171-87568
> scaffold100.1_size305125:87574-89457
> scaffold100.1_size305125:90495-305068
> scaffold1000.1_size94939

基本上我想将这些改进如下所示:

 > scaffold1.1a 
> scaffold10.1a
> scaffold100.1a
> scaffold100.1b
> ; scaffold100.1c
> scaffold100.1d
> scaffold100.1e
> scaffold100.1f
> scaffold100.1g
> scaffold1000.1a

或者甚至可以这样(但是这看起来会更复杂):

 > scaffold1.1 
> sca ffold10.1
> scaffold100.1a
> scaffold100.1b
> scaffold100.1c
> scaffold100.1d
> scaffold100.1e
> scaffold100.1f
> scaffold100.1g
> scaffold1000.1



我在这里做的是摆脱基因组的每个脚手架的所有大小的数据。对于碰巧碎裂的脚手架,我想用a,b,c,d等来表示它们。有几个脚手架有超过26个碎片,所以我可以用x,y,z,A, B,C,D ....等。

我正在考虑用一个简单的替换foreach循环来做到这一点:

 #!/ usr / bin / perl -w 

###打开文件
$ gen ='。 /Hc_genome/haemonchus_V1.fa';
打开(FASTAFILE,$ gen);
@lines =< FASTAFILE>;
#print @lines;

###将@符号添加到标签的开头
my @refined;
foreach我$行(@lines){
chomp $行;
$ lines =〜s /匹配后的所有内容.1 /用a,b,c .. etc / g替换;
推@refined,$ lines;
}
#print @refined;


###把数组推到一个新的fasta文件
打开FILE3,> ./Hc_genome/modded_haemonchus_V1.fa或死无法打开output.txt :$!;

foreach(@refined)
{
print FILE3$ _\\\
; #将数组中的每个条目打印到文件
}
关闭FILE3;

但是我不知道要在$ 1和\\在比赛中替换运营商。基本上,因为我不知道如何顺序地通过字母表为每个特定的脚手架片段(我可以管理的是添加一个到每个开始...)

如果您不介意,请告诉我如何达到这个目的!



非常感谢!



Andrew

解决方案

在Perl中,增量运算符 ++ 具有关于字符串的神奇行为。例如。 my $ s =a; $ a ++ 递增 $ a b。这一直持续到z,其中增量会产生aa等等。



您文件的标题看起来是正确排序的,所以我们可以遍历每个标题。从头文件中,我们提取开始部分(一切都包括 .1 )。如果这个起始部分与前一个首部的起始部分相同,我们增加我们的序列标识符。否则,我们将它设置为a

  use strict;使用警告; #用这些

my $ index =a开始每个脚本;
my $ prev =;

#迭代所有行(而不是一次读入所有25E3)
while(<>){

#通过非标题
行除非(/ ^> /){
print; #注释这行删除非标题行
next;
}

s / \\\\\。*。* // s; #删除.1之后的所有内容。如果($ _ eq $ prev){
$ index ++;
$ b $#
} else {
$ index =a;
}

#更新前一行
$ prev = $ _;

#输出新的标题
打印$ _ $ index \\\
;





用法: $ perl script.pl< ;. /Hc_genome/haemonchus_V1.fa> ./ Hc_genome / modded_haemonchus_V1.fa

编写接受STDIN输入并写入STDOUT的程序被认为是很好的风格,因为这样可以提高灵活性。不要在perl脚本中硬编码路径,而要保持脚本通用,并使用shell重定向运算符(如< )来指定输入。这也节省了手动打开文件的麻烦。



输出示例:

 > scaffold1.1a 
> scaffold10.1a
> scaffold100.1a
> scaffold100.1b
> scaffold100.1c
> ; scaffold100.1d
> scaffold100.1e
> scaffold100.1f
> scaffold100.1g
> scaffold1000.1a


I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example:

>scaffold1.1_size947603
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG....
>scaffold2.1_size747567.2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA....

and so on...

So, I have a problem with the fasta headers of the genome for the organism with which I am working with. Unfortunately the perl expertise needed to solve this problem seems to be beyond my current skill level :S So I was hoping someone on here could show me how it can be done.

My genome consists of about 25000 fasta headers and their respective sequences, the headers in their current state are giving me a lot of trouble with sequence aligners I am trying to use, so I have to simplify them significantly. Here is an example of my first few headers:

>scaffold1.1_size947603
>scaffold10.1_size550551
>scaffold100.1_size305125:1-38034
>scaffold100.1_size305125:38147-38987
>scaffold100.1_size305125:38995-44965
>scaffold100.1_size305125:76102-78738
>scaffold100.1_size305125:84171-87568
>scaffold100.1_size305125:87574-89457
>scaffold100.1_size305125:90495-305068
>scaffold1000.1_size94939

Essentially I would like to refine these to look like this:

>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a

Or perhaps even this (but this seems like it would be more complicated):

>scaffold1.1
>scaffold10.1
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1

What I'm doing here is getting rid of all the size data for each scaffold of the genome. For scaffolds that happen to be fragmented, I'd like to denote them with a,b,c,d etc. There are a few scaffolds with more than 26 fragments so perhaps I could denote them with x, y, z, A, B, C, D .... etc..

I was thinking to do this with a simple replace foreach loop similar to this:

#!/usr/bin/perl -w

### Open the files 
$gen = './Hc_genome/haemonchus_V1.fa';
open(FASTAFILE, $gen);
@lines = <FASTAFILE>;
#print @lines; 

###Add an @ symbol to the start of the label
my @refined;
foreach my $lines (@lines){ 
    chomp $lines;
    $lines =~ s/match everything after .1/replace it with a, b, c.. etc/g;
    push @refined, $lines;
}
#print @refined;


###Push the array on to a new fasta file
open FILE3, "> ./Hc_genome/modded_haemonchus_V1.fa" or die "Cannot open output.txt: $!";

foreach (@refined)
{
    print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;  

But I don't know have to build in the added alphabetical label additions between the $1 and the \n in the match and replace operator. Essentially because I'm not sure how to do it sequentially through the alphabet for each fragment of a particular scaffold (All I could manage is to add an a to the start of each one...)

Please if you don't mind, let me know how I might achieve this!

Much appreciated!

Andrew

解决方案

In Perl, the increment operator ++ has "magical" behaviour with respect to strings. E.g. my $s = "a"; $a++ increments $a to "b". This goes on until "z", where the increment will produce "aa" and so forth.

The headers of your file appear to be properly sorted, so we can just loop through each header. From the header, we extract the starting part (everything up to including the .1). If this starting part is the same as the starting part of the previous header, we increment our sequence identifier. Otherwise, we set it to "a":

use strict; use warnings;  # start every script with these

my $index = "a";
my $prev = "";

# iterate over all lines (rather than reading all 25E3 into memory at once)
while (<>) {

  # pass through non-header lines
  unless (/^>/) {
    print;  # comment this line to remove non-header lines
    next;
  }

  s/\.1\K.*//s;  # remove everything after ".1". Implies chomping

  # reset or increment $index
  if ($_ eq $prev) {
    $index++;
  } else {
    $index = "a";
  }

  # update the previous line
  $prev = $_;

  # output new header
  print "$_$index\n";
}

Usage: $ perl script.pl <./Hc_genome/haemonchus_V1.fa >./Hc_genome/modded_haemonchus_V1.fa.

It is considered good style to write programs that accept input from STDIN and write to STDOUT, as this improves flexibility. Rather than hardcoding paths in your perl script, keep your script general, and use shell redirection operators like < to specify the input. This also saves you the hassle of manually opening the files.

Example Output:

>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a

这篇关于简化列表/数组的元素,然后向它们添加增量标识符a,b,c,d ....等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆