在 Perl 中用更少的代码实现这个算法 [英] Implementing This Algorithm with Less line of Code at Perl

查看:76
本文介绍了在 Perl 中用更少的代码实现这个算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Perl 中实现这个算法.让我们接受:

I want to implement this algorithm in Perl. Let's accept that:

  • DNA1 = GACTAGGC
  • DNA2 = AGCTAGGA

DNA1 的第一个元素是 G,我们会找出 DNA2 处是否有 G 并用一个点指向它.我们继续它直到最后,所以图像将每个相同的元素交叉点显示为一个点.

First element of DNA1 is G, we will find if there is G at DNA2 and point it with a dot. We continue it till the end so the image shows evey same element intersections as a dot.

下一步是:连接点.要指向第一个点,一个应该在一个小方块的左上角,另一个在右下角(我的意思是线条应该有 135 度)如果严格度是 2,这意味着拒绝从 2 和少于 2 个点(这意味着如果严格性为 3,则图像中只有一行).

The next step is that: connecting points. To point to dots first one should be at left top corner of a small square and the other one at right bottom(I mean lines should have a 135 degree) If stringency is 2, it means that reject the lines which has occured from 2 and less then 2 dots(It means that if stringency was 3, there would be just one line at image).

最后一步是:字数统计.如果 wordcount 为 1(在图像处为 1),则表示将元素一一比较.如果是 3,则表示将其中的 3 个进行比较.您可以编写一个 wordcount 为 1 的程序,因为它始终为 1.

Last step is that: wordcount. If wordcount is 1(it is one at image) it means that compare elements one by one. If it was 3 it means that compare 3 of them together. You can write a program that wordcount is 1 because it is always 1.

我搜索了一下,这就是我所拥有的:

I searched about it and this is what I have:

$infile1 = "DNA1.txt";
$infile2 = "DNA2.txt";

$outfile = "plot.txt";
$wordsize=0;
$stringency=0;

open inf, $infile1 or die "STOP! File $infile1 not found.\n";
$sequence1=<inf>;
chomp $sequence1;
@seq1=split //,$sequence1;
close inf;

open inf, $infile2 or die "STOP! File $infile2 not found.\n";
$sequence2=<inf>;
chomp $sequence2;
@seq2=split //,$sequence2;
close inf;

$Lseq1=$#seq1+1;
$Lseq2=$#seq2+1;

open ouf, ">$outfile";

for ($i=0;$i<$Lseq1;$i++){
print ouf "\n";
for ($j=0;$j<$Lseq2;$j++){
  $match=0;
  for ($w=0;$w<=$wordsize;$w++){
    if($seq1[$i+$w] eq $seq2[$j+$w]){
      $match++;
    }
  }
  if($match > $stringency){
     print ouf "1";
  }
  else{
     print ouf "0";
  }
}
}

你能检查一下错误吗?我怎样才能用更少的 Perl 代码优化我的代码?

Can you check it about errors and how can I optimize my code with more less code at Perl?

PS:我认为每次都接受 $wordsize 等于 $stringency 是可以的.

PS: I think it is OK to accept $wordsize equals to $stringency every time.

编辑 1:我已经编辑了我的代码,它只用于放置点.

EDIT 1: I have edited my code and it works for just puting dot.

编辑 2: 算法是这样的:

qseq, sseq = sequences
win = number of elements to compare for each point
Strig = number of matches required for a point

for each q in qseq:
  for each s in sseq:
    if CompareWindow(qseq[q:q+win], s[s:s+win], strig):
      AddDot(q, s)

编辑 3: 这是一个更好的算法建议:

EDIT 3: Here is a better algorithm suggestion:

osl.iu.edu/~chemuell/projects/bioinf/dotplot.ppt

欢迎任何根据更好的算法改进代码的想法.

Any idea to improve code according to that better algorithm is welcome.

推荐答案

首先,最里面的 for 循环是完全没有必要的.摆脱它会加速你的代码.

First, the innermost for loop is totally unnecessary. Getting rid of it will speed up your code.

其次,您的代码不适用于 0 以外的 $stringency.

Secondly, your code doesn't work for $stringency other than 0.

修复:

use strict;
use warnings;

my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;

my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L;
for my $i (0..$#s1) {
   for my $j (0..$#s2) {
      if ($s1[$i] ne $s2[$j]) {
         $L[$i][$j] = 0;
      } elsif ($i == 0 || $j == 0) {
         $L[$i][$j] = 1;
      } else {
         $L[$i][$j] = $L[$i-1][$j-1] + 1;
      }

      print $L[$i][$j] <= $stringency ? "0" : "1";
   }
   print("\n");
}

现在我们有了一个有效的算法,我们可以调整实现.

Now that we have an efficient algorithm, we can tweak the implementation.

use strict;
use warnings;

my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;

my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L = (0) x @s2;
for my $i (0..$#s1) {
   for my $j (0..$#s2) {
      if ($s1[$i] eq $s2[$j]) {
         ++$L[$j];
      } else {
         $L[$j] = 0;
      }

      print $L[$j] <= $stringency ? "0" : "1";
   }

   print("\n");
   pop @L;
   unshift @L, 0;
}

如果您想更好地了解正在发生的事情,请替换

If you want a better idea of what's going on, replace

print $L[$j] <= $stringency ? "0" : "1";

print $L[$j];

你会得到类似的东西

01000110
10001002
00100000
00020000
10003001
02000410
01000150
00200000

顺便说一句,我们试图实现的目标与寻找最长公共子串非常相似.

By the way, what are trying to achieve is remarkably similar to finding the longest common substring.

更新从文件中获取$s1$s2,一次一行,

Update To get $s1 and $s2 from files, one line at a time,

open(my $fh1, '<', ...) or die(...);
open(my $fh2, '<', ...) or die(...);

for (;;) {
    my $s1 = <$fh1>;
    my $s2 = <$fh2>;

    die("Files have different length\n")
        if defined($s1) && !defined($s2)
        || !defined($s1) && defined($s2);

    last if !defined(($s1);

    chomp($s1);
    chomp($s2);

    ... code to generate graph ...
}

这篇关于在 Perl 中用更少的代码实现这个算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆