编辑Haskell中的距离算法 - 性能调整 [英] edit distance algorithm in Haskell - performance tuning

查看:110
本文介绍了编辑Haskell中的距离算法 - 性能调整的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Haskell中实现levenshtein距离(或编辑距离),但是当字符串长度增加时,它的性能会迅速下降。

我仍然Haskell相当新颖,所以如果你能给我一些关于如何改进算法的建议,那就太好了。我已经试着预先计算值(inits),但是因为它没有改变任何东西,所以我恢复了这个改变。



我知道已经有一个编辑距离的实施,但我需要它的工作任意标记的列表,不一定是字符串。

所以,以下是代码:




pre>
- 两个列表之间的标准levenshtein距离
editDistance :: Eq a => [a] - > [a] - > Int
editDistance s1 s2 = editDistance'1 1 1 s1 s2

- 加权levenshtein距离
- ins,sub和del是各种操作的费用
editDistance':: Eq a => Int - > Int - > Int - > [a] - > [a] - > Int
editDistance'__sins s1 [] = ins * length s1
editDistance'__ins [] s2 = ins * length s2
editDistance'del sub ins s1 s2
| last s1 == last s2 = editDistance'del sub ins(init s1)(init s2)
|其他=最小[editDistance'del sub s1(init s2)+ del - 删除
,editDistance'del sub ins(init s1)(init s2)+ sub - substitution
,editDistance'del子插件(init s1)s2 +插件
]

这似乎是一个正确的实现,至少它给出与此在线工具完全相同的结果。



预先感谢您的帮助!如果您需要任何其他信息,请告诉我们。



问候,
bzn

解决方案

忽略这是一个不好的算法(应该是记忆,我会到那一秒)...

使用O (1)原语而不是O(n)

其中一个问题是你使用了一大堆列表中的O(n)调用(haskell列表是单独链接列表)。一个更好的数据结构会给你O(1)操作,我使用了向量

 将限定的Data.Vector导入为V 

- 两个列表之间的标准levenshtein距离
editDistance :: Eq a => [a] - > [a] - > Int
editDistance s1 s2 = editDistance'1 1 1(V.fromList s1)(V.fromList s2)

- 加权levenshtein距离
- ins,sub和del是各种操作的成本
editDistance':: Eq a => Int - > Int - > Int - > V.Vector a - > V.Vector a - > Int
editDistance'del sub ins s1 s2
| V.null s2 = ins * V.length s1
| V.null s1 = ins * V.length s2
| V.last s1 == V.last s2 = editDistance'del sub ins(V.init s1)(V.init s2)
|否则=最小[editDistance'del sub ins s1(V.init s2)+ del - 删除
,editDistance'del sub ins(V.init s1)(V.init s2)+ sub - substitution
,editDistance'del sub ins(V.init s1)s2 + ins - insertion
]

列表O(n)的操作包括init,长度,以及最后(尽管init至少可以是懒惰的)。所有这些操作都是使用向量的O(1)。

虽然真正的基准测试应该使用 Criterion ,一个快速而肮脏的基准:

  str2 = replicate 15'a'++ replicate 25'b '
str1 = replicate 20'a'++ replicate 20'b'
main = print $ editDistance str1 str2

显示矢量版本需要0.09秒,而字符串需要1.6秒,所以我们甚至没有看到 editDistance 算法而保存了大约一个数量级。

现在如何记忆结果?



更大的问题显然是需要记忆。我将此作为学习 monad-memo 软件包的机会 - 我的上帝真的太棒了!对于一个额外的约束条件(您需要 Ord a ),您基本上可以毫不费力地进行记忆。代码:

 将限定的Data.Vector导入为V 
导入Control.Monad.Memo

- 两个列表之间的标准levenshtein距离
editDistance ::(等式a,等式a)=> [a] - > [a] - > Int
editDistance s1 s2 = startEvalMemo $ editDistance'(1,1,1,(V.fromList s1),(V.fromList s2))

- 加权levenshtein距离
- ins,sub和del是各种操作的成本
editDistance'::(MonadMemo(Int,Int,Int,V.Vector a,V.Vector a)Int m,Eq a)=> ; (Int,Int,Int,V.Vector a,V.Vector a) - > m Int
editDistance'(del,sub,ins,s1,s2)
| V.null s2 =返回$ ins * V.length s1
| V.null s1 =返回$ ins * V.length s2
| V.last s1 == V.last s2 = memo editDistance'(del,sub,ins,(V.init s1),(V.init s2))
| (del,sub,ins,s1,(V.init s2))
r2 < - memo editDistance'(del,sub,ins,(V .init s1),(V.init s2))
r3 < - memo editDistance'(del,sub,ins,(V.init s1),s2)
return $ minimum [r1 + del - 删除
,r2 +子替换
,r3 + ins - 插入
]

你看到memoization如何需要一个key(请参阅​​MonadMemo类)?我将所有参数打包成一个很大的丑陋元组。它还需要一个价值,这是你的结果 Int 。然后使用memo函数为要记忆的值插入和播放。

/ p>

  $ time ./so#记忆矢量版本
12

real 0m0.003s

$ time ./so3#未记忆的矢量版本
12

真实1m33.122s

甚至不要考虑运行非memoized字符串版本,我认为它至少需要大约15分钟。至于我,我现在喜欢monad-memo - 感谢Eduard包!

编辑:字符串和 Vector 在memoized版本中并没有那么多,但当距离达到200时仍然增长到2倍,所以仍然值得。



编辑:也许我应该解释为什么更大的问题是显然记忆结果。那么,如果你看看原始算法的核心:

  [editDistance'... s1(V.init s2) + del 
,editDistance'...(V.init s1)(V.init s2)+ sub
,editDistance'...(V.init s1)s2 + ins]

很明显,调用 editDistance的s1 s2 结果3次调用 editDistance' ...其中每个调用 editDistance'三次以上...还有三次......和AHHH!指数爆炸!幸运的是,大多数电话是相同的!例如(使用 - > 作为calls, eD 作为 editDistance' / code>):

  eD s1 s2  - > eD s1(init s2) - 父节点
,eD(init s1)s2
,eD(init s1)(init s2)
eD(init s1)s2 - > eD(init s1)(init s2) - 第一个子
,eD(init(init s1))s2
,eD(init(init s1))(init s2)
eD s1(init s2) - > eD s1(init(init s2))
,eD(init s1)(init s2)
,eD(init s1)(init(init s2))

仅仅考虑父类和两个直接子类,我们可以看到 ed(ini​​t s1)(init s2) code>完成三次。另一个孩子与父母共享呼叫,并且所有孩子都与他人共享许多呼叫(以及他们的孩子,提示Monty Python skit)。



这将是一种乐趣,也许有启发性,可以做一个 runMemo 类似的函数,它返回使用的缓存结果的数量。


I'm trying to implement the levenshtein distance (or edit distance) in Haskell, but its performance decreases rapidly when the string lenght increases.

I'm still quite new to Haskell, so it would be nice if you could give me some advice on how I could improve the algorithm. I already tried to "precompute" values (the inits), but since it didn't change anything I reverted that change.

I know there's already an editDistance implementation on Hackage, but I need it to work on lists of arbitrary tokens, not necessarily strings. Also, I find it a bit complicated, at least compared to my version.

So, here's the code:

-- standard levenshtein distance between two lists
editDistance      :: Eq a => [a] -> [a] -> Int
editDistance s1 s2 = editDistance' 1 1 1 s1 s2 

-- weighted levenshtein distance
-- ins, sub and del are the costs for the various operations
editDistance'      :: Eq a => Int -> Int -> Int -> [a] -> [a] -> Int
editDistance' _ _ ins s1 [] = ins * length s1 
editDistance' _ _ ins [] s2 = ins * length s2 
editDistance' del sub ins s1 s2  
    | last s1 == last s2 = editDistance' del sub ins (init s1) (init s2)
    | otherwise          = minimum [ editDistance' del sub ins s1 (init s2)        + del -- deletion 
                                   , editDistance' del sub ins (init s1) (init s2) + sub -- substitution
                                   , editDistance' del sub ins (init s1) s2        + ins -- insertion
                                   ]

It seems to be a correct implementation, at least it gives exactly the same results as this online tool.

Thanks in advance for your help! If you need any additional information, please let me know.

Greetings, bzn

解决方案

Ignoring that this is a bad algorithm (should be memoizing, I get to that second)...

Use O(1) Primitives and not O(n)

One problem is you use a whole bunch calls that are O(n) for lists (haskell lists are singly linked lists). A better data structure would give you O(1) operations, I used Vector:

import qualified Data.Vector as V

-- standard levenshtein distance between two lists
editDistance      :: Eq a => [a] -> [a] -> Int
editDistance s1 s2 = editDistance' 1 1 1 (V.fromList s1) (V.fromList s2)

-- weighted levenshtein distance
-- ins, sub and del are the costs for the various operations
editDistance'      :: Eq a => Int -> Int -> Int -> V.Vector a -> V.Vector a -> Int
editDistance' del sub ins s1 s2
  | V.null s2 = ins * V.length s1
  | V.null s1 = ins * V.length s2
  | V.last s1 == V.last s2 = editDistance' del sub ins (V.init s1) (V.init s2)
  | otherwise            = minimum [ editDistance' del sub ins s1 (V.init s2)        + del -- deletion 
                                   , editDistance' del sub ins (V.init s1) (V.init s2) + sub -- substitution
                                   , editDistance' del sub ins (V.init s1) s2        + ins -- insertion
                                   ]

The operations that are O(n) for lists include init, length, and last (though init is able to be lazy at least). All these operations are O(1) using Vector.

While real benchmarking should use Criterion, a quick and dirty benchmark:

str2 = replicate 15 'a' ++ replicate 25 'b'
str1 = replicate 20 'a' ++ replicate 20 'b'
main = print $ editDistance str1 str2

shows the vector version takes 0.09 seconds while strings take 1.6 seconds, so we saved about an order of magnitude without even looking at your editDistance algorithm.

Now what about memoizing results?

The bigger issue is obviously the need for memoization. I took this as an opportunity to learn the monad-memo package - my god is that awesome! For one extra constraint (you need Ord a), you get a memoization basically for no effort. The code:

import qualified Data.Vector as V
import Control.Monad.Memo

-- standard levenshtein distance between two lists
editDistance      :: (Eq a, Ord a) => [a] -> [a] -> Int
editDistance s1 s2 = startEvalMemo $ editDistance' (1, 1, 1, (V.fromList s1), (V.fromList s2))

-- weighted levenshtein distance
-- ins, sub and del are the costs for the various operations
editDistance' :: (MonadMemo (Int, Int, Int, V.Vector a, V.Vector a) Int m, Eq a) => (Int, Int, Int, V.Vector a, V.Vector a) -> m Int
editDistance' (del, sub, ins, s1, s2)
  | V.null s2 = return $ ins * V.length s1
  | V.null s1 = return $ ins * V.length s2
  | V.last s1 == V.last s2 = memo editDistance' (del, sub, ins, (V.init s1), (V.init s2))
  | otherwise = do
        r1 <- memo editDistance' (del, sub, ins, s1, (V.init s2))
        r2 <- memo editDistance' (del, sub, ins, (V.init s1), (V.init s2))
        r3 <- memo editDistance' (del, sub, ins, (V.init s1), s2)
        return $ minimum [ r1 + del -- deletion 
                         , r2 + sub -- substitution
                         , r3 + ins -- insertion
                                   ]

You see how the memoization needs a single "key" (see the MonadMemo class)? I packaged all the arguments as a big ugly tuple. It also needs one "value", which is your resulting Int. Then it's just plug and play using the "memo" function for the values you want to memoize.

For benchmarking I used a shorter, but larger-distance, string:

$ time ./so  # the memoized vector version
12

real    0m0.003s

$ time ./so3  # the non-memoized vector version
12

real    1m33.122s

Don't even think about running the non-memoized string version, I figure it would take around 15 minutes at a minimum. As for me, I now love monad-memo - thanks for the package Eduard!

EDIT: The difference between String and Vector isn't as much in the memoized version, but still grows to a factor of 2 when the distance gets to around 200, so still worth while.

EDIT: Perhaps I should explain why the bigger issue is "obviously" memoizing results. Well, if you look at the heart of the original algorithm:

 [ editDistance' ... s1          (V.init s2)  + del 
 , editDistance' ... (V.init s1) (V.init s2) + sub
 , editDistance' ... (V.init s1) s2          + ins]

It's quite clear a call of editDistance' s1 s2 results in 3 calls to editDistance'... each of which call editDistance' three more times... and three more time... and AHHH! Exponential explosion! Luckly most the calls are identical! for example (using --> for "calls" and eD for editDistance'):

eD s1 s2  --> eD s1 (init s2)             -- The parent
            , eD (init s1) s2
            , eD (init s1) (init s2)
eD (init s1) s2 --> eD (init s1) (init s2)         -- The first "child"
                  , eD (init (init s1)) s2
                  , eD (init (init s1)) (init s2) 
eD s1 (init s2) --> eD s1 (init (init s2))
                  , eD (init s1) (init s2)
                  , eD (init s1) (init (init s2))

Just by considering the parent and two immediate children we can see the call ed (init s1) (init s2) is done three times. The other child share calls with the parent too and all children share many calls with each other (and their children, cue Monty Python skit).

It would be a fun, perhaps instructive, exercise to make a runMemo like function that returns the number of cached results used.

这篇关于编辑Haskell中的距离算法 - 性能调整的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆