使用`library(aggregate)`计算两个序列中匹配元素的复杂性 [英] Complexity of counting matching elements in two sequences using `library(aggregate)`

查看:33
本文介绍了使用`library(aggregate)`计算两个序列中匹配元素的复杂性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们想要计算恰好代表 DNA 序列的两个(可能很长)字符串之间的对应关系.序列是字符列表,其中字符取自 a,c,t,g,'_',其中 '_' 表示不知道".占位符从不对应任何东西,甚至它本身.在这种情况下,我们使用 library(aggregate)(感谢 CapelliC为了这个想法):

We want to count the correspondences between two (possibly long) strings which happen to represent DNA sequences. The sequences are lists-of-chars where the char is taken from a,c,t,g,'_', with the '_' a "don't know" placeholder which never corresponds to anything, even itself. In this case, we employ library(aggregate) (thanks to CapelliC for the idea):

match(Seq1,Seq2,Count) :-
   aggregate_all(count,
      (
         nth1(Pos,Seq1,X),
         nth1(Pos,Seq2,X),
         memberchk(X,[a,c,g,t])
      ),
      N).

这种方法可以比作直截了当"的方法.一种方法是建立一个(尾递归)递归,它只是串联地遍历两个序列并成对地比较元素,并进行计数.

This approach can be compared to a "straightforward" approach where one would set up a (tail-recursive) recursion that just walks down both sequences in tandem and compares elements pairwise, counting as it goes.

由于序列可能非常大,算法的复杂性变得有些有趣.

As the sequences can be very large, algorithmic complexity becomes of some interest.

人们期望,n = length(sequence) 并且两个序列的长度相同:

One would expect, with n = length(sequence) and both sequences the same length:

  • 简单的方法:复杂度是O(n)
  • 聚合方法:复杂度为O(n²)

上述算法的(时间和空间)复杂度是多少?为什么?

What is the (time and maybe space) complexity of the above algorithm and why?

为了补充上述内容,基于 SWI-Prolog plunit 测试代码块:

To complement the above, an SWI-Prolog based plunit test code block:

:- begin_tests(atcg).

wrap_match(String1,String2,Count) :-
   atom_chars(String1,Seq1),
   atom_chars(String2,Seq2),
   fit(Seq1,Seq1,0,Count).

test("string 1 empty",nondet) :-
   wrap_match("atcg","",Count), 
   assertion(Count == 0).
      
test("string 2 empty") :-
   wrap_match("","atcg",Count), 
   assertion(Count == 0).

test("both strings empty") :-
   wrap_match("","",Count), 
   assertion(Count == 0).

test("both strings match, 1 char only") :-
   wrap_match("a","a",Count),   
   assertion(Count == 1).

test("both strings match") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 12).

test("both strings match with underscores") :-
   wrap_match("_TC_ATCG_TCG","_TC_ATCG_TCG",Count),   
   assertion(MatchCount == 9).

test("various mismatches 1") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 8).

test("various mismatches with underscores") :-
   wrap_match("at_ga_cg__cg","atcgatcgatcg",Count),   
   assertion(Count == 8).
   
:- end_tests(atcg).

所以:

?- run_tests.
% PL-Unit: atcg ........ done
% All 8 tests passed
true.

推荐答案

我认为观察复杂性 O(n²) 不是由于聚合方法本身,但是对于子目标 nth1(Pos,Seq1,X), nth1(Pos,Seq2,X) 表现为嵌套循环"(在序列的n 大小中).

I think that it is interresting to observe that complexity O(n²) is not due to the aggregation approach itself, but to the fact that subgoal nth1(Pos,Seq1,X), nth1(Pos,Seq2,X) behaves as a "nested loop" (in the size n of the sequences).

因此,应该有可能创建另一种算法,即使使用聚合,也可以具有复杂度 O(n),只要嵌套循环"被淘汰了.

Thus, it should be possible to create another algorithm that, even using aggregation, can have complexity O(n), as long as the "nested loop" is eliminated.

比较算法

% Original algorithm: Complexity O(n²)

match1(Seq1, Seq2, Count) :-
   aggregate_all(count,
      (  nth1(Pos, Seq1, X),
         nth1(Pos, Seq2, X),
         memberchk(X, [a,c,g,t]) ),
      Count).

% Proposed algorithm: Complexity O(n)

match2(Seq1, Seq2, Count) :-
   (   maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
   ->  aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
   ;   Count = 0 ).

gimme_random_sequence(Length, Seq) :-
   length(Seq, Length),
   maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).

test(N) :-
   gimme_random_sequence(N, S1),
   gimme_random_sequence(N, S2),
   time(match1(S1, S2, Count)),
   time(match2(S1, S2, Count)).

简单的实证结果

?- test(10000).
% 16,714,057 inferences, 1.156 CPU in 1.156 seconds (100% CPU, 14455401 Lips)
% 39,858 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.

?- test(20000).
% 66,761,535 inferences, 4.594 CPU in 4.593 seconds (100% CPU, 14533123 Lips)
% 79,826 inferences, 0.016 CPU in 0.016 seconds (100% CPU, 5108864 Lips)
true.

?- test(40000).
% 266,856,213 inferences, 19.734 CPU in 19.841 seconds (99% CPU, 13522405 Lips)
% 159,398 inferences, 0.016 CPU in 0.015 seconds (104% CPU, 10201472 Lips)
true.

?- test(80000).
% 1,067,046,835 inferences, 77.203 CPU in 77.493 seconds (100% CPU, 13821291 Lips)
% 320,226 inferences, 0.047 CPU in 0.047 seconds (100% CPU, 6831488 Lips)
true.

编辑 30/04/2021:nth1(I,S,X), nth1(I,S,X) 真的可以用作嵌套循环吗?

Edit 30/04/2021: Does nth1(I,S,X), nth1(I,S,X) really work as nested loop?

要看到这个问题的答案是,请考虑以下 nth/3 的简单实现,它计算找到每个解决方案所需的轮数,使用全局标志:

To see that the answer to this question is yes, consider the following simple implementation of nth/3, that counts the number of rounds needed to find each solution, using a global flag:

nth(Index, List, Item) :-
   (   var(Index)
   ->  nth_nondet(1, Index, List, Item)
   ;   integer(Index)
   ->  nth_det(Index, List, Item)
   ).

nth_det(1, [Item|_], Item) :- !.
nth_det(Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Index1 is Index - 1,
   nth_det(Index1, Rest, Item).

nth_nondet(Index, Index, [Item|_], Item).
nth_nondet(Acc, Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Acc1 is Acc + 1,
   nth_nondet(Acc1, Index, Rest, Item).

要得到轮数,可以问:

?- flag(rounds,_,0), nth(5,[a,b,c,d,e],X), flag(rounds,Rounds,Rounds).
X = e,
Rounds = 4.

现在,使用这个谓词,我们可以创建一个谓词来计算目标nth(I,L,X), nth(I,L,X)的轮数,对于列表不同长度:

Now, using this predicate, we can create a predicate to count the number of rounds of the goal nth(I,L,X), nth(I,L,X), for lists of different lengths:

count_rounds :-
    forall(between(1, 10, N),
           ( Length is 10*N,
             count_rounds(Length, Rounds),
             writeln(rounds(Length) = Rounds)
           )).

count_rounds(Length, _) :-
    numlist(1, Length, List),
    flag(rounds, _, 0),
    nth(Index, List, Item),
    nth(Index, List, Item),
    fail.
count_rounds(_, Rounds) :-
    flag(rounds, Rounds, Rounds).

实证结果:

?- count_rounds.
rounds(10) = 55
rounds(20) = 210
rounds(30) = 465
rounds(40) = 820
rounds(50) = 1275
rounds(60) = 1830
rounds(70) = 2485
rounds(80) = 3240
rounds(90) = 4095
rounds(100) = 5050

如我们所见,目标 nth(I,L,X), nth(I,L,X) 计算 n 阶方阵的一半(包括其对角线).因此,长度为 n 的列表的轮数是 rounds(n) = (n² +n)/2.因此,这个目标的时间复杂度是O(n²).

As we can see, the goal nth(I,L,X), nth(I,L,X) computes half of a square matrix of order n (including its diagonal). Thus, the number of rounds for a list of length n is rounds(n) = (n² + n)/2. Hence, the time complexity of this goal is O(n²).

备注 库谓词nth1/3的实现比谓词nth/3 考虑用于此实验.尽管如此,目标nth1(I,S,X), nth1(I,S,X)的时间复杂度仍然是O(n²).

Remark The implementation of the library predicate nth1/3 is a little more efficient than that of predicate nth/3considered for this experiment. Nevertheless, the time complexity of goal nth1(I,S,X), nth1(I,S,X)still is O(n²).

这篇关于使用`library(aggregate)`计算两个序列中匹配元素的复杂性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆