在Java中稀疏矩阵/阵列 [英] Sparse matrices / arrays in Java

查看:330
本文介绍了在Java中稀疏矩阵/阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的一个项目,用Java编写的,它要求我建立一个非常大的2-D稀疏数组。很稀疏,如果有差别。总之:这个应用程序的最重要的方面是在效率的置疑时间上(假设内存的负荷,但几乎没有这么无限的,以允许我使用一个标准的2-D阵列 - 关键的范围是在数十亿美元的两个维度)。

出了kajillion细胞阵列中的,将有几十万个细胞含有一个对象。我需要能够非常迅速地修改单元格的内容。

总之:有谁知道为了这个目的一个特别好的图书馆?它必须是大学伯克利分校,LGPL或类似的许可证(GPL没有,因为产品不能完全开源)。或者,如果有只是一个很简单的方法,使自制稀疏数组对象,我不会有事了。

我在考虑 MTJ ,但还没有听说过它的质量没有任何意见。

解决方案

建有包含HashMap数组疏林是非常低效的频繁读取的数据。最有效的实现中采用了特里,允许访问,此处段分布的单一载体

一个特里可以计算,如果一个元素是present表中的通过执行只读,只有两个数组索引来获得,其中一个元素存储在有效的位置,或者知道它没有从底层存储。

它还可以提供的备用存储阵列疏林的默认值,默认的位置,这样就不需要对返回的索引的测试,因为特里保证所有可能的来源分类指数将映射至少在后备存储的默认位置(在那里你会频繁地存储一个零,或空字符串或一个空对象)。

存在支持快速更新的尝试次数的实现,与otional紧凑型()操作来优化改进的后备存储的大小,在多个操作结束。尝试都远远超过包​​含HashMap快,因为他们不需要任何复杂的哈希函数,也不需要处理冲突的读取(带包含HashMap,你有碰撞无论是阅读和写作,这需要一个循环跳到下一个候选位置,并在他们每个人的测试,以有效的源指数比较...)

此外,爪哇包含HashMap只能索引对象,并为每个哈希来源分类指数创建一个Integer对象(此对象的创建将需要每次读取,而不只是写入)是昂贵的内存操作方面,因为它强调垃圾收集器。

我真的希望JRE包含一个IntegerTrieMap<对象>为缓慢的HashMap&LT的默认实现;整数,对象>或LongTrieMap<对象>作为更慢的HashMap&LT的默认实现;长,对象> ...但这仍然不是这种情况。



您可能想知道什么是特里?

这只是一个小的整数数组(在一个较小的范围比全范围的坐标为您的矩阵的),允许的坐标映射到向量中的整数位置。

例如,假设你想有一个1024 * 1024矩阵只含有少量非零值。而不是存储的矩阵含1024 * 1024元(100多万)一个数组,你可能只是想将其分割成大小为16 * 16的子范围,而你只需要64 * 64等子范围。

在这种情况下,Trie树索引将仅包含64 * 64个整数(4096),并且将有至少16 * 16个数据元素(包含默认零,或在稀疏矩阵中发现的最常见的子范围)。

和用于存储值的向量将只包含1个副本的子范围中等于彼此(由相同的子范围psented其中大多数是全零值,它们将被重新$ P $)。

所以,而是采用了类似的语法矩阵[I] [J] ,你会使用的语法如下:

  trie.values​​ [trie.subrangePositions [(I和〜15)+(J>> 4)] +
            ((ⅰ&安培; 15)&所述; 4;)+(J&安培; 15)]
 

这将使用为线索对象的访问方法可以更方便地处理

下面是一个例子,建成一个评论类(我希望它编译OK,因为它被简化;信号我,如果有错误纠正):

  / **
 *实现一个稀疏矩阵。目前仅限于静态的大小
 *(小于code取代; SIZE_I< / code取代中,< code取代; SIZE_I< / code取代)。
 * /
公共类DoubleTrie {

    / *矩阵的逻辑选择* /
    公共静态最终诠释SIZE_I = 1024;
    公共静态最终诠释SIZE_J = 1024;
    公共静态最后的双DEFAULT_VALUE = 0.0;

    / *内部分裂的选项* /
    私有静态最终诠释SUBRANGEBITS_I = 4;
    私有静态最终诠释SUBRANGEBITS_J = 4;

    / *内部衍生的分裂常数* /
    私有静态最终诠释SUBRANGE_I =
        1<< SUBRANGEBITS_I;
    私有静态最终诠释SUBRANGE_J =
        1<< SUBRANGEBITS_J;
    私有静态最终诠释SUBRANGEMASK_I =
        SUBRANGE_I  -  1;
    私有静态最终诠释SUBRANGEMASK_J =
        SUBRANGE_J  -  1;
    私有静态最终诠释SUBRANGE_POSITIONS =
        SUBRANGE_I * SUBRANGE_J;

    / *内部衍生的默认值的构造函数* /
    私有静态最终诠释SUBRANGES_I =
        (SIZE_I + SUBRANGE_I  -  1)/ SUBRANGE_I;
    私有静态最终诠释SUBRANGES_J =
        (SIZE_J + SUBRANGE_J  -  1)/ SUBRANGE_J;
    私有静态最终诠释子范围=
        SUBRANGES_I * SUBRANGES_J;
    私有静态最终诠释DEFAULT_POSITIONS [] =
        新INT [子范围](0);
    私有静态最后双DEFAULT_VALUES [] =
        新的双[SUBRANGE_POSITIONS](DEFAULT_VALUE);

    / *分裂子范围和偏移的内部快速计算。 * /
    私有静态最终诠释subrangeOf(
            最终诠释我,最终诠释J){
        返回(I>> SUBRANGEBITS_I)* SUBRANGE_J +
               (J>> SUBRANGEBITS_J);
    }
    私有静态最终诠释positionOffsetOf(
            最终诠释我,最终诠释J){
        返回(I和SUBRANGEMASK_I)* MAX_J +
               (J和SUBRANGEMASK_J);
    }

    / **
     *在java.lang.System中的实用程序缺少可比阵列
     *组件类型,其中包括所有原生类型,如双层这里。
     * /
    公共静态最终诠释arraycompare(
            最后的双重[]值1,最终诠释位置1,
            最后的双重[] values​​2,最终诠释位置2,
            最终诠释长度){
        如果(位置1> = 0&安培;&安培;情况2> = 0&安培;&安培;长度> = 0){
            而(length--大于0){
                双值1,值2;
                如果((值1 =值1 [位置1 +长])!=
                    (值2 = values​​2 [形势2 +长])){
                    / *注意:NaN值都来自不同的一切,包括
                     *所有NaN值;他们也比neigher下也不
                     *大于一切,包括NaN的。注意两个
                     *的无穷大值,以及非正规的值,是完全相同
                     *订购并以<,&LT媲美; =,= =,> =,>!=,=。注意
                     *在下面的评论,无限被认为是定义。
                     * /
                    如果(值1<值2)
                        返回-1; / *定义<定义。 * /
                    如果(值1>值2)
                        返回1; / *定义>定义。 * /
                    如果(值1 ==数值2)
                        返回0; / *定义==定义。 * /
                    / *一个或两个为NaN。 * /
                    如果(值1 ==值1)/ *是不是喃? * /
                        返回-1; / *定义<为NaN。 * /
                    如果(值2 ==值2)/ *是不是喃? * /
                        返回1; / *为NaN>定义。 * /
                    / *除此之外,两者都楠:检查自己的precise位
                     *范围0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL
                     *包括规范0x7FF8000000000000L,或在
                     *范围0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL。
                     *所需的只是排序稳定性(NaN是另有
                     *无序)。
                     * /
                    长RAW1,raw2;
                    如果((RAW1 = Double.doubleToRawLongBits(值1))!=
                        (raw2 = Double.doubleToRawLongBits(值2)))
                        返回RAW1< raw2? -1:1;
                    / *否则楠是严格相等,继续。 * /
                }
            }
            返回0;
        }
        抛出新的ArrayIndexOutOfBoundsException异常(
                的位置和长度不能否定);
    }

    / **
     *实用程序的快捷方式在同一阵列比较范围。
     * /
    公共静态最终诠释arraycompare(
            最后的双重[]值,
            最终诠释位置1,最终诠释位置2,
            最终诠释长度){
        返回arraycompare(值,位置1,价值,位置2,长度);
    }

    / **
     *在java.lang.System中的效用失踪的equalizable阵列
     *组件类型,其中包括所有原生类型,如双层这里。
     * /
    公共静态最终布尔arrayequals(
            最后的双重[]值1,最终诠释位置1,
            最后的双重[] values​​2,最终诠释位置2,
            最终诠释长度){
        返回arraycompare(值1,位置1,values​​2,情况2,长度)==
            0;
    }

    / **
     *实用程序的快捷方式,用于识别在同一阵列范围。
     * /
    公共静态最终布尔arrayequals(
            最后的双重[]值,
            最终诠释位置1,最终诠释位置2,
            最终诠释长度){
        返回arrayequals(值,位置1,价值,位置2,长度);
    }

    / **
     *实用程序的快捷方式复制范围在同一阵列。
     * /
    公共静态最终无效arraycopy(
            最后的双重[]值,
            最终诠释srcPosition,最终诠释dstPosition,
            最终诠释长度){
        arraycopy(值,srcPosition,价值观,dstPosition,长度);
    }

    / **
     *实用程序的快捷方式在开始调整大小的数组,preserving值。
     * /
    公共静态最后的双[] arraysetlength(
            双[]值,
            最终诠释满足newLength){
        最终诠释oldLength =
            values​​.length<满足newLength? values​​.length:满足newLength;
        System.arraycopy(值,0,值=新的双[满足newLength],0,
            oldLength);
        返回值;
    }

    / *内部实例成员。 * /
    私人双人值[];
    私人诠释subrangePositions [];
    私人布尔isSharedValues​​;
    私人布尔isSharedSubrangePositions;

    / *内部方法。 * /
    私人最终复位(
            最后的双重[]值,
            最终诠释[] subrangePositions){
        this.isSharedValues​​ =
            (this.values​​ =值)== DEFAULT_VALUES;
        this.isSharedsubrangePositions =
            (this.subrangePositions = subrangePositions)==
                DEFAULT_POSITIONS;
    }

    / **
     *重置矩阵具有相同的初始值填充。
     *
     *参数初值的值设置在所有细胞的位置。
     * /
    公共复位(最终双初值= DEFAULT_VALUE){
        复位(
            (初值== DEFAULT_VALUE)?默认值 :
                新的双[SUBRANGE_POSITIONS](初值),
            DEFAULT_POSITIONS);
    }

    / **
     *默认的构造函数,使用一个默认值。
     *
     *参数初值替代默认值初始化所有
     *矩阵中的位置。
     * /
    公共DoubleTrie(最终双初值= DEFAULT_VALUE){
        this.reset(初值);
    }

    / **
     *这是一个包含了一个有用的preinitialized实例
     * DEFAULT_VALUE在所有细胞。
     * /
    公共静态DoubleTrie DEFAULT_INSTANCE =新DoubleTrie();

    / **
     *拷贝构造函数。请注意,源线索可能是不可变的
     * 或不;但是这个构造函数将创建一个新的可变线索
     *即使新的线索最初全体存储其
     *当源也使用共享存储源。
     * /
    公共DoubleTrie(最终DoubleTrie源){
        this.values​​ =(this.isSharedValues​​ =
            source.isSharedValues​​)?
            source.values​​:
            source.values​​.clone();
        this.subrangePositions =(this.isSharedSubrangePositions =
            source.isSharedSubrangePositions)?
            source.subrangePositions:
            source.subrangePositions.clone());
    }

    / **
     *快速索引的getter。
     *
     * @参数我排的位置,在矩阵设置。
     *位置@参数Ĵ列矩阵中的设置。
     返回:在该位置存储在矩阵中的价值。
     * /
    公共双getAt(最终诠释我,最终诠释J){
        返回值[subrangePositions [subrangeOf(I,J)] +
                      positionOffsetOf(I,J)];
    }

    / **
     *快速索引的二传手。
     *
     * @参数我排的位置,在矩阵疏林设置。
     *位置@参数Ĵ列的矩阵疏林设置。
     *参数值,该值设置在此位置。
     返回:所传递的价值。
     *注:此设置后,不压缩基质疏林。
     * @see紧凑型(空)
     * /
    公共双setAt(最终诠释我,最终诠释我,最终的双重价值){
       最终诠释子范围= subrangeOf(I,J);
       最终诠释positionOffset = positionOffsetOf(I,J);
       //快速检查,看看是否分配将改变一些东西。
       INT subrangePosition,valuePosition;
       如果(Double.compare(
               值[valuePosition =
                   (subrangePosition = subrangePositions [子范围)+
                   positionOffset]
               值)!= 0){
               / *所以,我们需要在价值观进行有效的分配。
                *检查当前的子范围分配共享的不是。
                *注意,我们也包括了DEFAULT_VALUES其可以是
                *其他几个(未测试)特里实例共享,
                *包括由拷贝构造器实例化。 * /
               如果(isSharedValues​​){
                   值= values​​.clone();
                   isSharedValues​​ = FALSE;
               }
               / *扫描所有其他子范围检查中值位置
                *以分配由另一个子范围共享。 * /
               对于(INT otherSubrange = subrangePositions.length;
                       --otherSubrange> = 0; ){
                   如果(otherSubrange!=子区域)
                       继续; / *忽略目标子范围。 * /
                   / *注:范围以下的试验是安全的未来
                    *常见的子范围(待办事项,紧凑的())交织,
                    *虽然,就目前而言,子范围共享位置
                    *只有与他们共同的开始和结束位置,所以我们
                    *可能也只能进行简单的测试< code取代;
                    *(otherSubrangePosition == subrangePosition)< / code>中
                    *测试,而不是位置的两个边界
                    *其他子范围的间隔。 * /
                   INT otherSubrangePosition;
                   如果((otherSubrangePosition =
                           subrangePositions [otherSubrange])> =
                           valuePosition&功放;&安培;
                           otherSubrangePosition + SUBRANGE_POSITIONS&LT​​;
                           valuePosition){
                       / *目标位置由一些其他共享
                        *子范围,我们需要使其成为唯一通过克隆
                        *子范围,以更大的价值载体,复制所有
                        *在新的载体结束当前的子范围值,
                        *指定新值之前。这将需要
                        *改变当前子范围的位置,但
                        *在这之前,我们首先需要检查
                        * subrangePositions阵列本身也分享
                        间实例(包括DEFAULT_POSITIONS *
                        *这应该是preserved,以及可能的数组
                        *由外部工厂构造器的共享
                        *源特里被宣布为不可变的派生
                        * 类)。 * /
                       如果(isSharedSubrangePositions){
                           subrangePositions = subrangePositions.clone();
                           isSharedSubrangePositions = FALSE;
                       }
                       / * TODO:不尝试分配不到一个
                        *完全独立的子范围,使用可能
                        *交错:这将需要扫描所有
                        *其他现有的值找到匹配的
                        *值的改性子范围;但是这可能
                        *可能离开的位置(在当前的子范围
                        *值)未引用任何子范围,后
                        *为当前子区间位置变化。本
                        *扫描可能比登天长为每
                        * assignement,现在来看它的假设紧凑型()
                        *将用于以后,这些assignements后。 * /
                       值= setlengh(
                           值,
                           (subrangePositions [子范围] =
                            subrangePositions = values​​.length)+
                           SUBRANGE_POSITIONS);
                       valuePosition = subrangePositions + positionOffset;
                       打破;
                   }
               }
               / *现在执行的价值的有效分配。 * /
               值[valuePosition] =值;
           }
       }
       返回值;
    }

    / **
     *紧凑共同子范围的存储空间。
     * TODO:这是一个简单的实现没有交错,其
     *将提供更好的数据通信pression。然而,交错其
     * O(N²)的复杂性,其中N是值的总长度,应
     *可在这之后基本COM pression,其复杂度只有尝试
     * O(N²),n为SUBRANGE_POSITIIONS倍N.小
     * /
    公共无效紧凑型(){
        最终诠释oldValues​​Length = values​​.length;
        INT newValues​​Length = 0;
        为(中间体oldPosition = 0;
                 oldPosition< oldValues​​Length;
                 oldPosition + = SUBRANGE_POSITIONS){
            INT oldPosition =位置[子范围]
            布尔commonSubrange = FALSE;
            / *可能的共同的子范围扫描值。 * /
            对于(INT在newPosition = newValues​​Length;
                    (在newPosition  -  = SUBRANGE_POSITIONS)GT; = 0; )
                如果(arrayequals(值,在newPosition,oldPosition,
                        SUBRANGE_POSITIONS)){
                    commonSubrange =真;
                    所有匹配] | / *更新subrangePositions
                     *从oldPosition到在newPosition位置。有可能
                     *有几个指标的变化,如果线索已经有
                     *压实()之前,后来重新分配。 * /
                    对于(子范围= subrangePositions.length;
                         --subrange> = 0; )
                        如果(subrangePositions [子范围] == oldPosition)
                            subrangePositions [子范围] =在newPosition;
                    打破;
                }
            如果(!commonSubrange){
                / *下移非共同的价值观,如果一些previous
                 *子范围已经融为一体pressed时,他们普遍。
                 * /
                如果(commonSubrange&安培;!&安培;!oldPosition = newValues​​Length){
                    arraycopy(值,oldPosition,newValues​​Length,
                        SUBRANGE_POSITIONS);
                    / *先进的COM pressed值preserve这些新的问题。 * /
                    newValues​​Length + = SUBRANGE_POSITIONS;
                }
            }
        }
        / *检查CO​​M pressed值的数量。 * /
        如果(newValues​​Length< oldValues​​Length){
            值= values​​.arraysetlength(newValues​​Length);
            isSharedValues​​ = FALSE;
        }
    }

}
 

注:此code是不完全的,因为它处理一个矩阵的大小,其压实仅限于只检测常见的子范围,而不交错它们

另外,code没有确定它是最好的宽度或高度,以使用用于分离矩阵成子范围(对于x或y坐标),根据该矩阵大小。它只是使用16的相同的静态子范围的大小(这两个坐标),但它可能是2方便其他小功率(但2非权力将放慢 INT的indexOf(INT,INT ) INT对offsetof(INT,INT)内部方法),independantly为坐标,到矩阵的最大宽度或高度。理想情况下,紧凑型()方法应该能够确定最适合的尺寸。

如果这些分裂子范围的大小可以变化,然后会有一个需要添加实例成员对这些子范围的大小,而不是静态 SUBRANGE_POSITIONS ,并且使静态方法 INT subrangeOf(INT I,诠释J) INT positionOffsetOf(INT I,诠释J)成非静态的;并且初始化数组 DEFAULT_POSITIONS DEFAULT_VALUES 将需要删除或重新定义有所不同。

如果你想支持交错,基本上你会通过将现有值两个大致相同的尺寸(均为最低子范围大小的倍数,第一个子集可能有一个比第二多一个子范围内启动),你会扫描较大的一个在所有连续位置查找匹配的交织;那么你会尝试匹配这些值。然后,您将循环recursely分割成两半的子集(最低子范围的大小也多),你会再次扫描,以配合这些子集(这将会乘以2子集数:你如果加倍怀疑该subrangePositions指数的大小是值得相比于现有的值大小的值,看它是否提供了一个有效的COM pression如果没有,你停在那里(:您从交错融为一体$找到最佳的子范围大小直接p $ pssion处理)在这种情况下;子范围的大小将是可变的,在压实过程中

但是,这code显示如何将非零值,并重新分配数据阵列额外的(不为零)子范围,以及如何可以优化(与紧凑型()使用任务已经完成之后, setAt(INT I,诠释J,双值)方法)时有可能在数据内统一的,并且在 subrangePositions 阵列在相同的位置重新索引重复子范围这一数据的存储。

总之,一个线索的所有原则都没有实现的:

  1. 它总是更快(更紧凑的内存,这意味着更好的地方)重新present使用单一的载体,而不是阵列的双索引数​​组(每一个单独分配)的矩阵。的改进是在双getAt(INT,INT)方法可见!

  2. 您节省了大量的空间,但赋值时,它可能需要一段时间来重新分配新的子范围。出于这个原因,所述子范围不应该太小,否则将发生的重新分配过于频繁了设置矩阵

  3. 有可能通过检测常见的子范围自动变换的初始大矩阵成更紧凑的矩阵。然后一个典型的实现将包含诸如紧凑型()上面的方法。但是,如果的get()的访问是非常快的,并设置()是相当快的,紧凑的()有可能,如果有减去一个大非稀疏randomly-时,很多常见的子范围,以COM preSS(如速度很慢填充矩阵本身,或由零相乘:这将是更简单和更快的在这种情况下,由instanciating一个新的和删除旧的)重置特里结构

  4. 通用子范围,使用通用存储中的数据,所以这个共享的数据必须是只读的。如果你必须改变不改变基体的其余部分的单个值,必须首先确保它被引用只有一次在 subrangePositions 指数。否则,你就需要(方便在结束)的载体的任何位置分配一个新的子范围,然后这个新的子界的位置存储到 subrangePositions 指数。



需要注意的是通用柯尔特库,虽然很不错,因为它是稀疏矩阵的计算工作时也没有那么好,因为它使用散列(或行-COM presssed)工艺不落实尝试支持就目前而言,尽管它是一个很好的优化,这既是节省空间的节省时间,特别是对于最常见的getAt()操作。

即使在这里描述的尝试的setAt()操作可以节省大量的时间(的方式在这里实现的,即没有自动压缩设置后,它仍然可以根据需求实现,估计时间,其中压实仍然会节省大量的在时间的价格存储空间):保存时间正比于细胞的子范围的数量,节省空间是成反比的每子范围细胞的数目。一个好的compromize如果然后使用一个子范围的大小,例如每子范围细胞的数量是细胞在二维矩阵的总数的平方根(具有三维矩阵的计算工作时这将是一个立方根)。

散列在柯尔特稀疏矩阵实现中使用的工艺有不便之处,他们增添了不少存储开销,和缓慢的访问时间,由于可能的碰撞。尝试能避免所有的冲突,然后可以保修时间节省直线为O(n)为O(1)时间在最坏的情况下,如果(n)是可能的碰撞(其中,在对于稀疏矩阵,可能是数达非默认值中的细胞基质的数量,也就是直到矩阵倍的因子正比于散列填充率的大小的总数量,对于一个非稀疏即全矩阵)。

在柯尔特使用的RC(行玉米pressed)工艺是从尝试次数越近,但是这是在另一种价格,这里使用的玉米pression工艺中,具有非常慢的存取时间为最频繁的只读get()操作,而且很慢COM pression的setAt()操作。此外,所使用的玉米pression不正交,不像在尝试次数的这个presentation其中正交是preserved。尝试也将是preserve这种正交相关观看操作,如跨步,换位(可以看作是基于整数环模操作的跨步操作),子区域(和subselections在一般情况下,包括与分类视图)。

我只是希望柯尔特将在未来的更新,以实现使用来尝试其它的实现(即TrieSparseMatrix,而不只是HashSparseMatrix和RCSparseMatrix)。这些想法是在这篇文章。

的实施特罗韦(总部设在内部 - > INT地图)也都是基于散列工艺类似小马的HashedSparseMatrix,即它们具有相同的不便。尝试将是快了很多,以消费适度的额外空间(但这个空间可以进行优化,成为甚至比特罗韦和柯尔特,在延期的时间,利用最后的紧凑型()离子操作的结果矩阵/线索)。

注:本特里实现绑定到特定的本机类型(这里是双)。这是自愿的,因为一般的实现用拳击类型有巨大的空间开销(与慢得多的acccess时间)。在这里,它只是采用双,而不是一般的矢量原生的一维数组。但它绝对有可能获得一个通用的实施以及为尝试次数......不幸的是,Java的仍然不允许写真正泛型类与本机类型的所有好处,但通过写多个实现(对于一般的Object类型或每个原生型),并通过式工厂提供所有这些操作。语言应该能够自动实例化本机实现自动建工厂(现在它甚至没有在Java 7中的情况下,这是一些地方的.Net仍保持其优势,真正泛型类型是一样快本地类型)。

I'm working on a project, written in Java, which requires that I build a very large 2-D sparse array. Very sparse, if that makes a difference. Anyway: the most crucial aspect for this application is efficency in terms of time (assume loads of memory, though not nearly so unlimited as to allow me to use a standard 2-D array -- the key range is in the billions in both dimensions).

Out of the kajillion cells in the array, there will be several hundred thousand cells which contain an object. I need to be able to modify cell contents VERY quickly.

Anyway: Does anyone know a particularly good library for this purpose? It would have to be Berkeley, LGPL or similar license (no GPL, as the product can't be entirely open-sourced). Or if there's just a very simple way to make a homebrew sparse array object, that'd be fine too.

I'm considering MTJ, but haven't heard any opinions on its quality.

解决方案

Sparsed arrays built with hashmaps are very inefficient for frequently read data. The most efficient implementations uses a Trie that allows access to a single vector where segments are distributed.

A Trie can compute if an element is present in the table by performing only read-only TWO array indexing to get the effective position where an element is stored, or to know if its absent from the underlying store.

It can also provide a default position in the backing store for the default value of the sparsed array, so that you don't need ANY test on the returned index, because the Trie guarantees that all possible source index will map at least to the default position in the backing store (where you'll frequently store a zero, or an empty string or a null object).

There exists implementations that support fast-updatable Tries, with an otional "compact()" operation to optimze the size of the backing store at end of multiple operations. Tries are MUCH faster than hashmaps, because they don't need any complex hashing function, and don't need to handle collisions for reads (With Hashmaps, you have collision BOTH for reading and for writing, this requires a loop to skip to the next candidate position, and a test on each of them to compare the effective source index...)

In addition, Java Hashmaps can only index on Objects, and creating an Integer object for each hashed source index (this object creation will be needed for every read, not just writes) is costly in terms of memory operations, as it stresses the garbage collector.

I really hoped that the JRE included an IntegerTrieMap<Object> as the default implementation for the slow HashMap<Integer, Object> or LongTrieMap<Object> as the default implementation for the even slower HashMap<Long, Object>... But this is still not the case.



You may wonder what is a Trie?

It's just a small array of integers (in a smaller range than the full range of coordinates for your matrix) that allows mapping the coordinates into an integer position in a vector.

For example suppose you want a 1024*1024 matrix containing only a few non zero values. Instead of storing that matrix in a array containing 1024*1024 elements (more than 1 million), you may just want to split it into subranges of size 16*16, and you'll just need 64*64 such subranges.

In that case, the Trie index will contain just 64*64 integers (4096), and there will be at least 16*16 data elements (containing the default zeroes, or the most common subrange found in your sparse matrix).

And the vector used to store the values will contain only 1 copy for subranges that are equal with each other (most of them being full of zeroes, they will be represented by the same subrange).

So instead of using a syntax like matrix[i][j], you'd use a syntax like:

trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] +
            ((i & 15) << 4) + (j & 15)]

which will be more conveniently handled using an access method for the trie object.

Here is an example, built into a commented class (I hope it compiles OK, as it was simplified; signal me if there are errors to correct):

/**
 * Implement a sparse matrix. Currently limited to a static size
 * (<code>SIZE_I</code>, <code>SIZE_I</code>).
 */
public class DoubleTrie {

    /* Matrix logical options */        
    public static final int SIZE_I = 1024;
    public static final int SIZE_J = 1024;
    public static final double DEFAULT_VALUE = 0.0;

    /* Internal splitting options */
    private static final int SUBRANGEBITS_I = 4;
    private static final int SUBRANGEBITS_J = 4;

    /* Internal derived splitting constants */
    private static final int SUBRANGE_I =
        1 << SUBRANGEBITS_I;
    private static final int SUBRANGE_J =
        1 << SUBRANGEBITS_J;
    private static final int SUBRANGEMASK_I =
        SUBRANGE_I - 1;
    private static final int SUBRANGEMASK_J =
        SUBRANGE_J - 1;
    private static final int SUBRANGE_POSITIONS =
        SUBRANGE_I * SUBRANGE_J;

    /* Internal derived default values for constructors */
    private static final int SUBRANGES_I =
        (SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I;
    private static final int SUBRANGES_J =
        (SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J;
    private static final int SUBRANGES =
        SUBRANGES_I * SUBRANGES_J;
    private static final int DEFAULT_POSITIONS[] =
        new int[SUBRANGES](0);
    private static final double DEFAULT_VALUES[] =
        new double[SUBRANGE_POSITIONS](DEFAULT_VALUE);

    /* Internal fast computations of the splitting subrange and offset. */
    private static final int subrangeOf(
            final int i, final int j) {
        return (i >> SUBRANGEBITS_I) * SUBRANGE_J +
               (j >> SUBRANGEBITS_J);
    }
    private static final int positionOffsetOf(
            final int i, final int j) {
        return (i & SUBRANGEMASK_I) * MAX_J +
               (j & SUBRANGEMASK_J);
    }

    /**
     * Utility missing in java.lang.System for arrays of comparable
     * component types, including all native types like double here.
     */
    public static final int arraycompare(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        if (position1 >= 0 && position2 >= 0 && length >= 0) {
            while (length-- > 0) {
                double value1, value2;
                if ((value1 = values1[position1 + length]) !=
                    (value2 = values2[position2 + length])) {
                    /* Note: NaN values are different from everything including
                     * all Nan values; they are are also neigher lower than nor
                     * greater than everything including NaN. Note that the two
                     * infinite values, as well as denormal values, are exactly
                     * ordered and comparable with <, <=, ==, >=, >=, !=. Note
                     * that in comments below, infinite is considered "defined".
                     */
                    if (value1 < value2)
                        return -1;        /* defined < defined. */
                    if (value1 > value2)
                        return 1;         /* defined > defined. */
                    if (value1 == value2)
                        return 0;         /* defined == defined. */
                    /* One or both are NaN. */
                    if (value1 == value1) /* Is not a NaN? */
                        return -1;        /* defined < NaN. */
                    if (value2 == value2) /* Is not a NaN? */
                        return 1;         /* NaN > defined. */
                    /* Otherwise, both are NaN: check their precise bits in
                     * range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL
                     * including the canonical 0x7FF8000000000000L, or in
                     * range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL.
                     * Needed for sort stability only (NaNs are otherwise
                     * unordered).
                     */
                    long raw1, raw2;
                    if ((raw1 = Double.doubleToRawLongBits(value1)) !=
                        (raw2 = Double.doubleToRawLongBits(value2)))
                        return raw1 < raw2 ? -1 : 1;
                    /* Otherwise the NaN are strictly equal, continue. */
                }
            }
            return 0;
        }
        throw new ArrayIndexOutOfBoundsException(
                "The positions and length can't be negative");
    }

    /**
     * Utility shortcut for comparing ranges in the same array.
     */
    public static final int arraycompare(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arraycompare(values, position1, values, position2, length);
    }

    /**
     * Utility missing in java.lang.System for arrays of equalizable
     * component types, including all native types like double here.
     */ 
    public static final boolean arrayequals(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        return arraycompare(values1, position1, values2, position2, length) ==
            0;
    }

    /**
     * Utility shortcut for identifying ranges in the same array.
     */
    public static final boolean arrayequals(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arrayequals(values, position1, values, position2, length);
    }

    /**
     * Utility shortcut for copying ranges in the same array.
     */
    public static final void arraycopy(
            final double[] values,
            final int srcPosition, final int dstPosition,
            final int length) {
        arraycopy(values, srcPosition, values, dstPosition, length);
    }

    /**
     * Utility shortcut for resizing an array, preserving values at start.
     */
    public static final double[] arraysetlength(
            double[] values,
            final int newLength) {
        final int oldLength =
            values.length < newLength ? values.length : newLength;
        System.arraycopy(values, 0, values = new double[newLength], 0,
            oldLength);
        return values;
    }

    /* Internal instance members. */
    private double values[];
    private int subrangePositions[];
    private bool isSharedValues;
    private bool isSharedSubrangePositions;

    /* Internal method. */
    private final reset(
            final double[] values,
            final int[] subrangePositions) {
        this.isSharedValues =
            (this.values = values) == DEFAULT_VALUES;
        this.isSharedsubrangePositions =
            (this.subrangePositions = subrangePositions) ==
                DEFAULT_POSITIONS;
    }

    /**
     * Reset the matrix to fill it with the same initial value.
     *
     * @param initialValue  The value to set in all cell positions.
     */
    public reset(final double initialValue = DEFAULT_VALUE) {
        reset(
            (initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES :
                new double[SUBRANGE_POSITIONS](initialValue),
            DEFAULT_POSITIONS);
    }

    /**
     * Default constructor, using single default value.
     *
     * @param initialValue  Alternate default value to initialize all
     *                      positions in the matrix.
     */
    public DoubleTrie(final double initialValue = DEFAULT_VALUE) {
        this.reset(initialValue);
    }

    /**
     * This is a useful preinitialized instance containing the
     * DEFAULT_VALUE in all cells.
     */
    public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie();

    /**
     * Copy constructor. Note that the source trie may be immutable
     * or not; but this constructor will create a new mutable trie
     * even if the new trie initially shares some storage with its
     * source when that source also uses shared storage.
     */
    public DoubleTrie(final DoubleTrie source) {
        this.values = (this.isSharedValues =
            source.isSharedValues) ?
            source.values :
            source.values.clone();
        this.subrangePositions = (this.isSharedSubrangePositions =
            source.isSharedSubrangePositions) ?
            source.subrangePositions :
            source.subrangePositions.clone());
    }

    /**
     * Fast indexed getter.
     *
     * @param i  Row of position to set in the matrix.
     * @param j  Column of position to set in the matrix.
     * @return   The value stored in matrix at that position.
     */
    public double getAt(final int i, final int j) {
        return values[subrangePositions[subrangeOf(i, j)] +
                      positionOffsetOf(i, j)];
    }

    /**
     * Fast indexed setter.
     *
     * @param i      Row of position to set in the sparsed matrix.
     * @param j      Column of position to set in the sparsed matrix.
     * @param value  The value to set at this position.
     * @return       The passed value.
     * Note: this does not compact the sparsed matric after setting.
     * @see compact(void)
     */
    public double setAt(final int i, final int i, final double value) {
       final int subrange       = subrangeOf(i, j);
       final int positionOffset = positionOffsetOf(i, j);
       // Fast check to see if the assignment will change something.
       int subrangePosition, valuePosition;
       if (Double.compare(
               values[valuePosition =
                   (subrangePosition = subrangePositions[subrange]) +
                   positionOffset],
               value) != 0) {
               /* So we'll need to perform an effective assignment in values.
                * Check if the current subrange to assign is shared of not.
                * Note that we also include the DEFAULT_VALUES which may be
                * shared by several other (not tested) trie instances,
                * including those instanciated by the copy contructor. */
               if (isSharedValues) {
                   values = values.clone();
                   isSharedValues = false;
               }
               /* Scan all other subranges to check if the position in values
                * to assign is shared by another subrange. */
               for (int otherSubrange = subrangePositions.length;
                       --otherSubrange >= 0; ) {
                   if (otherSubrange != subrange)
                       continue; /* Ignore the target subrange. */
                   /* Note: the following test of range is safe with future
                    * interleaving of common subranges (TODO in compact()),
                    * even though, for now, subranges are sharing positions
                    * only between their common start and end position, so we
                    * could as well only perform the simpler test <code>
                    * (otherSubrangePosition == subrangePosition)</code>,
                    * instead of testing the two bounds of the positions
                    * interval of the other subrange. */
                   int otherSubrangePosition;
                   if ((otherSubrangePosition =
                           subrangePositions[otherSubrange]) >=
                           valuePosition &&
                           otherSubrangePosition + SUBRANGE_POSITIONS <
                           valuePosition) {
                       /* The target position is shared by some other
                        * subrange, we need to make it unique by cloning the
                        * subrange to a larger values vector, copying all the
                        * current subrange values at end of the new vector,
                        * before assigning the new value. This will require
                        * changing the position of the current subrange, but
                        * before doing that, we first need to check if the
                        * subrangePositions array itself is also shared
                        * between instances (including the DEFAULT_POSITIONS
                        * that should be preserved, and possible arrays
                        * shared by an external factory contructor whose
                        * source trie was declared immutable in a derived
                        * class). */
                       if (isSharedSubrangePositions) {
                           subrangePositions = subrangePositions.clone();
                           isSharedSubrangePositions = false;
                       }
                       /* TODO: no attempt is made to allocate less than a
                        * fully independant subrange, using possible
                        * interleaving: this would require scanning all
                        * other existing values to find a match for the
                        * modified subrange of values; but this could
                        * potentially leave positions (in the current subrange
                        * of values) unreferenced by any subrange, after the
                        * change of position for the current subrange. This
                        * scanning could be prohibitively long for each
                        * assignement, and for now it's assumed that compact()
                        * will be used later, after those assignements. */
                       values = setlengh(
                           values,
                           (subrangePositions[subrange] =
                            subrangePositions = values.length) +
                           SUBRANGE_POSITIONS);
                       valuePosition = subrangePositions + positionOffset;
                       break;
                   }
               }
               /* Now perform the effective assignment of the value. */
               values[valuePosition] = value;
           }
       }
       return value;
    }

    /**
     * Compact the storage of common subranges.
     * TODO: This is a simple implementation without interleaving, which
     * would offer a better data compression. However, interleaving with its
     * O(N²) complexity where N is the total length of values, should
     * be attempted only after this basic compression whose complexity is
     * O(n²) with n being SUBRANGE_POSITIIONS times smaller than N.
     */
    public void compact() {
        final int oldValuesLength = values.length;
        int newValuesLength = 0;
        for (int oldPosition = 0;
                 oldPosition < oldValuesLength;
                 oldPosition += SUBRANGE_POSITIONS) {
            int oldPosition = positions[subrange];
            bool commonSubrange = false;
            /* Scan values for possible common subranges. */
            for (int newPosition = newValuesLength;
                    (newPosition -= SUBRANGE_POSITIONS) >= 0; )
                if (arrayequals(values, newPosition, oldPosition,
                        SUBRANGE_POSITIONS)) {
                    commonSubrange = true;
                    /* Update the subrangePositions|] with all matching
                     * positions from oldPosition to newPosition. There may
                     * be several index to change, if the trie has already
                     * been compacted() before, and later reassigned. */
                    for (subrange = subrangePositions.length;
                         --subrange >= 0; )
                        if (subrangePositions[subrange] == oldPosition)
                            subrangePositions[subrange] = newPosition;
                    break;
                }
            if (!commonSubrange) {
                /* Move down the non-common values, if some previous
                 * subranges have been compressed when they were common.
                 */
                if (!commonSubrange && oldPosition != newValuesLength) {
                    arraycopy(values, oldPosition, newValuesLength,
                        SUBRANGE_POSITIONS);
                    /* Advance compressed values to preserve these new ones. */
                    newValuesLength += SUBRANGE_POSITIONS;
                }
            }
        }
        /* Check the number of compressed values. */
        if (newValuesLength < oldValuesLength) {
            values = values.arraysetlength(newValuesLength);
            isSharedValues = false;
        }
    }

}

Note: this code is not complete because it handles a single matrix size, and its compactor is limited to detect only common subranges, without interleaving them.

Also, the code does not determine where it is the best width or height to use for splitting the matrix into subranges (for x or y coordinates), according to the matrix size. It just uses the same static subrange sizes of 16 (for both coordinates), but it could be conveniently any other small power of 2 (but a non power of 2 would slow down the int indexOf(int, int) and int offsetOf(int, int) internal methods), independantly for both coordinates, and up to the maximum width or height of the matrix.ideally the compact() method should be able to determine the best fitting sizes.

If these splitting subranges sizes can vary, then there will be a need to add instance members for these subrange sizes instead of the static SUBRANGE_POSITIONS, and to make the static methods int subrangeOf(int i, int j) and int positionOffsetOf(int i, int j) into non static; and the initialization arrays DEFAULT_POSITIONSand DEFAULT_VALUES will need to be dropped or redefined differently.

If you want to support interleaving, basically you'll start by dividing the existing values in two of about the same size (both being a multiple of the minimum subrange size, the first subset possibly having one more subrange than the second one), and you'll scan the larger one at all successive positions to find a matching interleaving; then you'll try to match these values. Then you'll loop recursely by dividing the subsets in halves (also a multiple of the minimum subrange size) and you'll scan again to match these subsets (this will multiply the number of subsets by 2: you have to wonder if the doubled size of the subrangePositions index is worth the value compared to the existing size of the values to see if it offers an effective compression (if not, you stop there: you have found the optimum subrange size directly from the interleaving compression process). In that case; the subrange size will be mutable, during compaction.

But this code shows how you assign non-zero values and reallocate the data array for additional (non-zero) subranges, and then how you can optimize (with compact() after assignments have been performed using the setAt(int i, int j, double value) method) the storage of this data when there are duplicate subranges that may be unified within the data, and reindexed at the same position in the subrangePositions array.

Anyway, all the principles of a trie are implemented there:

  1. It is always faster (and more compact in memory, meaning better locality) to represent a matrix using a single vector instead of a double-indexed array of arrays (each one allocated separately). The improvement is visible in the double getAt(int, int) method !

  2. You save a lot of space, but when assigning values it may take time to reallocate new subranges. For this reason, the subranges should not be too small or reallocations will occur too frequently for setting up your matrix.

  3. It is possible to transform an initial large matrix automatically into a more compact matrix by detecting common subranges. A typical implementation will then contain a method such as compact() above. However, if get() access is very fast and set() is quite fast, compact() may be very slow if there are lots of common subranges to compress (for example when substracting a large non-sparse randomly-filled matrix with itself, or multiplying it by zero: it will be simpler and much faster in that case to reset the trie by instanciating a new one and dropping the old one).

  4. Common subranges use common storage in the data, so this shared data must be read-only. If you must change a single value without changing the rest of the matrix, you must first make sure that it is referenced only one time in the subrangePositions index. Otherwise you'll need to allocate a new subrange anywhere (conveniently at end) of the values vector, and then store the position of this new subrange into the subrangePositions index.



Note that the generic Colt library, though very good as it is, is not as good when working on sparse matrice, because it uses hashing (or row-compresssed) technics which do not implement the support for tries for now, despite it is an excellent optimization, which is both space-saving and time-saving, notably for the most frequent getAt() operations.

Even the setAt() operation described here for tries saves lot of time (the way is is implemented here, i.e. without automatic compaction after setting, which could still be implemented based on demand and estimated time where compaction would still save lot of storage space at the price of time): the time saving is proportional to the number of cells in subranges, and space saving is inversely proportional to the number of cells per subrange. A good compromize if then to use a subrange size such the number of cells per subrange is the square root of the total number of cells in a 2D matrix (it would be a cubic root when working with 3D matrice).

Hashing technics used in Colt sparse matrix implementations have the inconvenience that they add a lot of storage overhead, and slow access time due to possible collisions. Tries can avoid all collisions, and can then warranty to save linear O(n) time to O(1) time in the worst cases, where (n) is the number of possible collisions (which, in case of sparse matrix, may be up to the number of non-default-value cells in the matrix, i.e. up to the total number of size of the matrix times a factor proportional to the hashing filling factor, for a non-sparse i.e. full matrix).

The RC (row-compressed) technics used in Colt are nearer from Tries, but this is at another price, here the compression technics used, that have very slow access time for the most frequent read-only get() operations, and very slow compression for setAt() operations. In addition, the compression used is not orthogonal, unlike in this presentation of Tries where orthogonality is preserved. Tries would also be preserve this orthogonality for related viewing operations such as striding, transposition (viewed as a striding operation based on integer cyclic modular operations), subranging (and subselections in general, including with sorting views).

I just hope that Colt will be updated in some future to implement another implementation using Tries (i.e. TrieSparseMatrix instead of just HashSparseMatrix and RCSparseMatrix). The ideas are in this article.

The Trove implementation (based in int->int maps) are also based on hashing technics similar to the Colt's HashedSparseMatrix, i.e. they have the same inconvenience. Tries will be a lot faster, with a moderate additional space consumed (but this space can be optimized and become even better than Trove and Colt, in a deferred time, using a final compact()ion operation on the resulting matrix/trie).

Note: this Trie implementation is bound to a specific native type (here double). This is voluntary, because generic implementation using boxing types have a huge space overhead (and are much slower in acccess time). Here it just uses native unidimensional arrays of double rather than generic Vector. But it is certainly possible to derive a generic implementation as well for Tries... Unfortunately, Java still does not allow writing really generic classes with all the benefits of native types, except by writing multiple implementations (for a generic Object type or for each native type), and serving all these operation via a type factory. The language should be able to automatically instanciate the native implementations and build the factory automatically (for now it is not the case even in Java 7, and this is something where .Net still maintains its advantage for really generic types that are as fast as native types).

这篇关于在Java中稀疏矩阵/阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆