就地基数排序 [英] In-Place Radix Sort

查看:103
本文介绍了就地基数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个长的文本。请多多包涵。归结,问题是:?是否有可行就地基数排序算法


preliminary

我有一个巨大的的小的固定长度的字符串只能使用字母A号,C,G和T(是的,你已经猜到它: DNA ),我想排序

目前,我用的std ::排序,它使用 introsort STL 的所有常见的实现。这工作得很好。但是,我相信,基数排序千篇一律设置我的问题很好,也应该更好在实践中。

详细信息

我有一个很天真的实现和相对较小的输入(上万数量级),这是真实的(嗯,至少两倍以上的速度)测试了这个假设。然而,运行时降低一败涂地时,问题的规模变大( N 的> 5000000)。

原因是显而易见的:基数排序需要复制整个数据(不止一次在我幼稚的做法,实际上)。这意味着,我已经把〜4吉布到我的主内存,这显然是杀害性能。即使没有,我不能使用这么多的内存,因为问题规模实际上成了更大。

的用例

理想情况下,该算法应与2和100之间的任何字符串长度工作,用于DNA以及DNA5(其允许额外的通配符N),或甚至用的DNA的 IUPAC 歧义codeS (产生16个不同的值)。但是,我知道,所有这些情况下,不能被覆盖,所以我很高兴与任何速度的提高,我得到。在code可以动态地决定派遣的算法。

研究

不幸的是,在基数的维基百科的文章排序是没用的。关于就地变体的部分是完整的垃圾。对基数的 NIST-DADS段排序是不存在的下一个。有一个有前途的冠冕堂皇的纸被称为有效的自适应就地基数排序描述算法MSL 。不幸的是,本文中,也同样是令人失望的。

在特定的,有下面的东西。

首先,算法包含几个错误和留下了很多不明原因。特别是,它并没有详细的递归调用(我只是假设,它增加或减少一些指针,计算当前移位和屏蔽)。同时,它采用了功能 dest_group dest_address 没有给予定义。我看不出如何实施这些有效的(即在O(1);至少 dest_address 是不平凡的)。

最后但并非最不重要的,该算法实现就地岬通过交换数组索引与输入数组中的元素。这显然​​只适用于数字阵列。我需要使用它的字符串。当然,我可以只拧强类型并继续假设内存会容忍我存储在它不属于索引。但这只只要我可以挤我的字符串转换成32位的内存(假设32位整数)的作品。这只是16个字符(让我们忽略了的那一刻起16>日志(5000000))。

另一篇论文的作者之一,没有给出准确的描述可言,但它给MSL的运行时间为亚线性这是平出错误的。

回顾一下:有没有找到工作的参考实现的希望或工作就地基数排序的DNA串的作品至少有一个良好的伪code /说明<? / P>

解决方案

好了,这里有一个简单的实现了MSD基数排序的DNA中。这是写在D中,因为这是我用得最多,所以我最不可能做的愚蠢的错误的语言,但它可以很容易地被翻译成其他语言。这是就地,但需要2 * seq.length通过数组。

 无效radixSort(字符串[] seqs,为size_t基地= 0){
    如果(seqs.length == 0)
        返回;

    为size_t TPOS = seqs.length,准= 0;
    为size_t I = 0;
    而(I&LT; TPOS){
        如果(seqs [I] [基] =='A'){
             交换(seqs [I],seqs [APOS ++]);
             我++;
        }
        否则,如果(seqs [I] [基] =='T'){
            交换(seqs [I],seqs [ -  TPOS]);
        }否则我++;
    }

    I =者;
    为size_t CPOS =者;
    而(I&LT; TPOS){
        如果(seqs [I] [基] =='C'){
            交换(seqs [I],seqs [CPOS ++]);
        }
        我++;
    }
    如果(基地&LT; seqs [0] .length  -  1){
        radixSort(seqs [0..APos],基地+ 1);
        radixSort(seqs [APos..CPos],基地+ 1);
        radixSort(seqs [CPos..TPos],基地+ 1);
        radixSort(seqs [TPos..seqs.length],基地+ 1);
   }
}
 

显然,这是一种特异的DNA,而不是作为一般的,但它应是快速

编辑:我好奇这个code是否实际工作,所以我测试/调试它,而等待自己的生物信息学code运行。上面现在的版本实际上是测试和工程。对于每5个碱基千万序列,它的3倍左右比优化introsort更快。

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?


Preliminary

I've got a huge number of small fixed-length strings that only use the letters "A", "C", "G" and "T" (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.

Details

I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character "N"), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.

Research

Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm "MSL". Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.

To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

解决方案

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

void radixSort(string[] seqs, size_t base = 0) {
    if(seqs.length == 0)
        return;

    size_t TPos = seqs.length, APos = 0;
    size_t i = 0;
    while(i < TPos) {
        if(seqs[i][base] == 'A') {
             swap(seqs[i], seqs[APos++]);
             i++;
        }
        else if(seqs[i][base] == 'T') {
            swap(seqs[i], seqs[--TPos]);
        } else i++;
    }

    i = APos;
    size_t CPos = APos;
    while(i < TPos) {
        if(seqs[i][base] == 'C') {
            swap(seqs[i], seqs[CPos++]);
        }
        i++;
    }
    if(base < seqs[0].length - 1) {
        radixSort(seqs[0..APos], base + 1);
        radixSort(seqs[APos..CPos], base + 1);
        radixSort(seqs[CPos..TPos], base + 1);
        radixSort(seqs[TPos..seqs.length], base + 1);
   }
}

Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.

Edit: I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

这篇关于就地基数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆