快速的方式来计算对出现在Clojure 1.4中的频率 [英] Fast way to count how often pairs appear in Clojure 1.4

查看:105
本文介绍了快速的方式来计算对出现在Clojure 1.4中的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算特定的整数对出现在某些过程中的次数(启发式检测使用局部敏感哈希的类似图像 - 整数表示图像,下面的邻居是具有相同哈希值的图像,因此



计数存储为从(有序)对到计数的映射( >匹配

)。



输入( nbrs-list 被认为是邻居的整数列表的列表,其中内部列表(邻居)中的每个不同(有序)对应该被计数。因此,例如,如果 nbrs-list [[1,2,3],[10,8,9]] ,那么对是 [1,2],[1,3],[2,3],[8,9],[8,10],[9,10] code>。



例程 collect 被多次调用; matches 参数累积结果, nbrs-list 在每次调用时都是新的。邻居的最小数目(内部列表的大小)是1,最大〜1000。 nbrs-list 中的每个整数在对 collect 的任何调用中只发生一次(调用,没有配对出现多于一次),整数覆盖范围0 - 1,000,000(因此 nbrs-list 的大小小于1,000,000 - 值只出现一次,有时它们出现在组中 - 但通常大于100,000 - 因为大多数图像没有邻居)。



我已经从更大的块中提取了例程

 (defn-flatten-1 
[list]
(apply concat list))

(defn- pairs
([nbrs]
(let [nbrs(seq nbrs)]
(对(剩余nbrs)(第一个nbrs)(其余nbrs))))
([nbrs a bs]
(lazy-seq
b $ b(如果bs
(let [b(first bs)]
(cons(if(> ab)[[ba] 1] [[ab] 1])(pair nbrs a休息bs))))
(对nbrs))))))

(defn- pairs-map
[nbrs]
(println(count nbrs) )
(let [pairs-list(flatten-1(pairs nbrs))]
(apply hash-map pairs-list)))

(defn- collect
[matches nbrs-list]
(let [to-add(map pairs-map nbrs-list)]
(merge-with + matches(apply(partial merge-with + ))))

所以上面的代码将每一组邻居扩展为有序对;创建一个从 1 ;然后使用添加值组合地图。



我希望运行速度更快。我没有看到如何避免O(n ^ 2)扩展的对,但我想象我可以通过避免这么多中间结构至少减少常量开销。同时,我想让代码相当紧凑和可读...



哦,现在我超过了GC开销限制。所以减少内存使用/ churn也是一个优先级:o)



[也许这太具体了?我希望课程是一般的,并没有看起来很多关于优化真正的clojure代码的帖子。此外,我可以配置文件等,但我的代码看起来很丑陋,我希望有一个明显的,更清洁的方法 - 特别是对扩展。]

解决方案

上述建议似乎有帮助,但是不够。我最终得到了不错的表现:




  • 包装对成值。这是因为 MAX_LONG> 1e12 和 long 实例是可比较的(因此可以很好地作为散列键,不像 long [2] )。与 [n1 n2] 相比,这对降低记忆使用具有显着影响。


  • a href =http://trove4j.sourceforge.net/javadocs/gnu/trove/map/hash/TLongIntHashMap.html =nofollow> TLongByteHashMap 原始类型散列映射并将其改变。

    c>处理带嵌套的配对代码 doseq 循环(或嵌套 for 循环时使用不可变的数据结构)。


  • 改善我的位置敏感哈希。很大一部分的问题是,它太弱了,所以找到太多的邻居 - 如果一百万的图像的邻居有病的约束,那么你得到一百万对,这消耗了一点太多的记忆...




内部循环现在如下:

 (doseq [i(range 1(count nbrs))] 
(doseq [j(range i)]
(let [pair(pack size(nth nbrs i) nbrs j))]
(if-let [value(.get matches pair)]
(.put matches pair(byte(inc value)))
0))))))

其中

 (defn-pack 
[size n1 n2]
(+(* size n1)n2))


[size pair]
[(int(/ pair size))(mod pair size)])


I need to count the number of times that particular pairs of integers occur in some process (a heuristic detecting similar images using locality sensitive hashing - integers represent images and the "neighbours" below are images that have the same hash value, so the count indicates how many different hashes connect the given pair of images).

The counts are stored as a map from (ordered) pair to count (matches below).

The input (nbrs-list below) is a list of a list of integers that are considered neighbours, where every distinct (ordered) pair in the inner list ("the neighbours") should be counted. So, for example, if nbrs-list is [[1,2,3],[10,8,9]] then the pairs are [1,2],[1,3],[2,3],[8,9],[8,10],[9,10].

The routine collect is called multiple times; the matches parameter accumulates results and the nbrs-list is new on each call. The smallest number of neighbours (size of the inner list) is 1 and the largest ~1000. Each integer in nbrs-list occurs just once in any call to collect (this implies that, on each call, no pair occurs more than once) and the integers cover the range 0 - 1,000,000 (so the size of nbrs-list is less than 1,000,000 - since each value occurs just once and sometimes they occur in groups - but typically larger than 100,000 - as most images have no neighbours).

I have pulled the routines out of a larger chunk of code, so they may contain small edit errors, sorry.

(defn- flatten-1
  [list]
  (apply concat list))

(defn- pairs
  ([nbrs]
    (let [nbrs (seq nbrs)]
      (if nbrs (pairs (rest nbrs) (first nbrs) (rest nbrs)))))
  ([nbrs a bs]
    (lazy-seq
      (let [bs (seq bs)]
        (if bs
          (let [b (first bs)]
            (cons (if (> a b) [[b a] 1] [[a b] 1]) (pairs nbrs a (rest bs))))
          (pairs nbrs))))))

(defn- pairs-map
  [nbrs]
  (println (count nbrs))
  (let [pairs-list (flatten-1 (pairs nbrs))]
    (apply hash-map pairs-list)))

(defn- collect
  [matches nbrs-list]
  (let [to-add (map pairs-map nbrs-list)]
    (merge-with + matches (apply (partial merge-with +) to-add))))

So the above code expands each set of neighbours to ordered pairs; creates a map from pairs to 1; then combines maps using addition of values.

I'd like this to run faster. I don't see how to avoid the O(n^2) expansion of pairs, but I imagine I can at least reduce the constant overhead by avoiding so many intermediate structures. At the same time, I'd like the code to be fairly compact and readable...

Oh, and now I am exceeding the "GC overhead limit". So reducing memory use/churn is also a priority :o)

[Maybe this is too specific? I am hoping the lessons are general and haven't seem many posts about optimising "real" clojure code. Also, I can profile etc, but my code seems so ugly I am hoping there's an obvious, cleaner approach - particularly for the pairs expansion.]

解决方案

the above suggestions seemed to help, but were insufficient. i finally got decent performance with:

  • packing pairs into a long value. this works because MAX_LONG > 1e12 and long instances are comparable (so work well as hash keys, unlike long[2]). this had a significant effect on lowering memory use compared to [n1 n2].

  • using a TLongByteHashMap primitive type hash map and mutating it.

  • handling the pair code with nested doseq loops (or nested for loops when using immutable data structures).

  • improving my locality sensitive hash. a big part of the problem was that it was too weak, so finding too many neighbours - if the neighbours of a million images are ill-constrained then you get a million million pairs, which consumes a little too much memory...

the inner loop now looks like:

(doseq [i (range 1 (count nbrs))]
  (doseq [j (range i)]
    (let [pair (pack size (nth nbrs i) (nth nbrs j))]
      (if-let [value (.get matches pair)]
        (.put matches pair (byte (inc value)))
        (.put matches pair (byte 0))))))

where

(defn- pack
  [size n1 n2]
  (+ (* size n1) n2))

(defn- unpack
  [size pair]
  [(int (/ pair size)) (mod pair size)])

这篇关于快速的方式来计算对出现在Clojure 1.4中的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆