对HashTable性能问题感到好奇 [英] Curious about the HashTable performance issues

查看:115
本文介绍了对HashTable性能问题感到好奇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读到Haskell中的哈希表存在性能问题(在 Haskell - 安全在2006年和在2009年的飞行青蛙顾问的博客),因为我喜欢Haskell它让我担心。这是一年前,现在是什么状态(2010年6月)?问题是垃圾回收器需要遍历可变数组(垃圾回收器)的指针(盒装数组)寻找指向可能准备释放的数据的指针。盒装的可变数组是实现哈希表的主要机制,因此特定的结构显示了GC遍历问题。这对很多语言都很常见。症状是垃圾收集过多(在GC中花费的时间高达95%)。

解决方法是在GC中实现卡片标记,用于2009年底发生的可变数组指针。您不应该看到过多GC在Haskell中使用可变指针数组时。在简单的基准测试中,大哈希的散列表插入提高了10倍。

注意,GC散步问题不会影响纯粹的功能性结构,也不是unboxed数组(像大多数数据一样)并行数组,或者在Haskell中使用向量数组。它是否会影响存储在C堆上的散列表(例如 judy ),意思是它没有影响日常Haskellers不使用命令性的散列表。



如果您在Haskell中使用散列表,现在您不应该观察任何问题。一个简单的哈希表程序,它可以将1000万个整数插入散列表中,因为原始引用没有提供任何代码或基准。 b
$ b

  import Control.Monad 
将限定的Data.HashTable导入为H
import System.Environment

main = do
[size]< - fmap(fmap read)getArgs
m< - H.new(==)H.hashInt
forM_ [1..size] $ \ n - > H.insert mnn
v< - H.lookup m 100
print v

使用GHC 6.10.2,在修复之前,插入10M整数:

  $ time ./A 10000000 + RTS -s 
...
47s。

使用GHC 6.13,修正后:

  ./ A 10000000 + RTS -s 
...
8s

增加默认堆区:

  ./ A + RTS -s -A2G 
...
2.3s

避免散列表并使用IntMap: p>

  import Control.Monad 
导入Data.List
导入限定的Data.IntMap作为I
导入System.Environment
$ b $ main = do
[size]< - fmap(fmap read)getArgs
let k = foldl'(\mn - > I.insert nnm )I.empty [1..size]
print $ I. lookup 100 k

我们得到:

  $ time ./A 10000000 + RTS -s 
./A 10000000 + RTS -s
6s

或者,使用judy数组(这是一个Haskell封装器,调用C代码通过外部函数接口):

  import Control.Monad 
i mport Data.List
import System.Environment
将合格的Data.Judy导入为J

main = do
[size]< - fmap(fmap read)getArgs
j < - J.new :: IO(J.JudyL Int)
forM_ [1..size] $ \\\
- > J.insert(来自整数n)n j
print =<< J. lookup 100 j

运行这个,

  $ time ./A 10000000 + RTS -s 
...
2.1s
固定的,并且有总是其他库和数据结构强>这是完全合适的。总之,这是一个非问题。



注意:截至2013年,您应该使用 hashtables 包,它支持一系列可变哈希表本地。


I read that hash tables in Haskell had performance issues (on the Haskell-Cafe in 2006 and Flying Frog Consultancy's blog in 2009), and since I like Haskell it worried me.

That was a year ago, what is the status now (June 2010)? Has the "hash table problem" been fixed in GHC?

解决方案

The problem was that the garbage collector is required to traverse mutable arrays of pointers ("boxed arrays") looking for pointers to data that might be ready to deallocate. Boxed, mutable arrays are the main mechanism for implementing a hashtable, so that particular structure showed up the GC traversal issue. This is common to many languages. The symptom is excessive garbage collection (up to 95% of time spent in GC).

The fix was to implement "card marking" in the GC for mutable arrays of pointers, which occured in late 2009. You shouldn't see excessive GC when using mutable arrays of pointers in Haskell now. On the simple benchmarks, hashtable insertion for large hashes improved by 10x.

Note that the GC walking issue doesn't affect purely functional structures, nor unboxed arrays (like most data parallel arrays, or vector-like arrays, in Haskell. Nor does it affect hashtables stored on the C heap (like judy). Meaning that it didn't affect day-to-day Haskellers not using imperative hash tables.

If you are using hashtables in Haskell, you shouldn't observe any issue now. Here, for example, is a simple hashtable program that inserts 10 million ints into a hash. I'll do the benchmarking, since the original citation doesn't present any code or benchmarks.

import Control.Monad
import qualified Data.HashTable as H
import System.Environment

main = do
  [size] <- fmap (fmap read) getArgs
  m <- H.new (==) H.hashInt
  forM_ [1..size] $ \n -> H.insert m n n
  v <- H.lookup m 100
  print v

With GHC 6.10.2, before the fix, inserting 10M ints:

$ time ./A 10000000 +RTS -s
...
47s.

With GHC 6.13, after the fix:

./A 10000000 +RTS -s 
...
8s

Increasing the default heap area:

./A +RTS -s -A2G
...
2.3s

Avoiding hashtables and using an IntMap:

import Control.Monad
import Data.List
import qualified Data.IntMap as I
import System.Environment

main = do
  [size] <- fmap (fmap read) getArgs
  let k = foldl' (\m n -> I.insert n n m) I.empty [1..size]
  print $ I.lookup 100 k

And we get:

$ time ./A 10000000 +RTS -s        
./A 10000000 +RTS -s
6s

Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through the foreign-function interface):

import Control.Monad
import Data.List
import System.Environment
import qualified Data.Judy as J

main = do
  [size] <- fmap (fmap read) getArgs
  j <- J.new :: IO (J.JudyL Int)
  forM_ [1..size] $ \n -> J.insert (fromIntegral n) n j
  print =<< J.lookup 100 j

Running this,

$ time ./A 10000000 +RTS -s
...
2.1s

So, as you can see, the GC issue with hashtables is fixed, and there have always been other libraries and data structures which were perfectly suitable. In summary, this is a non-issue.

Note: as of 2013, you should probably just use the hashtables package, which supports a range of mutable hashtables natively.

这篇关于对HashTable性能问题感到好奇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆