优化Haskell代码 [英] Optimizing Haskell code

查看：86 发布时间：2018/6/4 15:14:22 performance optimization haskell

本文介绍了优化Haskell代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图学习Haskell，在关于马尔可夫文本链的reddit的一篇文章中，我决定首先在Python中实现Markov文本生成，然后在Haskell中实现。但是我注意到我的python实现比Haskell版本快，即使Haskell编译为本地代码。我想知道我应该怎么做才能让Haskell代码运行得更快，现在我相信由于使用Data.Map而不是hashmaps，所以速度非常慢，但我不确定

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure

我将发布Python代码和Haskell。使用相同的数据，Python需要大约3秒，Haskell接近16秒。

I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.

不用说我会采取任何建设性的批评：）。 p>

It comes without saying that I'll take any constructive criticism :).

import random
import re
import cPickle
class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
        picklefd.close()

    def train(self, text):
        splitted = re.findall(r"(\w+|[.!?',])", text)
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

    def concat(self, words):
        sentence = ""
        for word in words:
            if word in "'\",?!:;.":
                sentence = sentence[0:-1] + word + " "
            else:
                sentence += word + " "
        return sentence

    def pickword(self, words):
        temp = [(k, words[k]) for k in words]
        results = []
        for (word, n) in temp:
            results.append(word)
            if n > 1:
                for i in xrange(n-1):
                    results.append(word)
        return random.choice(results)

    def gentext(self, words):
        allwords = [k for k in self.cache]
        (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
        sentence = [first, second]
        while len(sentence) < words or sentence[-1] is not ".":
            current = (sentence[-2], sentence[-1])
            if current in self.cache:
                followup = self.pickword(self.cache[current])
                sentence.append(followup)
            else:
                print "Wasn't able to. Breaking"
                break
        print self.concat(sentence)

Markov(["76.txt"])

--

module Markov ( train , fox ) where import Debug.Trace import qualified Data.Map as M import qualified System.Random as R import qualified Data.ByteString.Char8 as B type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) train :: [B.ByteString] -> Database train (x:y:[]) = M.empty train (x:y:z:xs) = let l = train (y:z:xs) in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l main = do contents <- B.readFile "76.txt" print $ train $ B.words contents fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

推荐答案

我尽量避免做任何奇特或微妙的事情。这只是两种方法来进行分组;第一个强调模式匹配，第二个不强调。

I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.

import Data.List (foldl') import qualified Data.Map as M import qualified Data.ByteString.Char8 as B type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) train2 :: [B.ByteString] -> Database2 train2 words = go words M.empty where go (x:y:[]) m = m go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1 addWord (Just m') = Just $ M.alter inc z m' inc Nothing = Just 1 inc (Just cnt) = Just $ cnt + 1 in go (y:z:xs) $ M.alter addWord (x,y) m train3 :: [B.ByteString] -> Database2 train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) where update m (x,y,z) = M.alter (addWord z) (x,y) m addWord word = Just . maybe (M.singleton word 1) (M.alter inc word) inc = Just . maybe 1 (+1) main = do contents <- B.readFile "76.txt" let db = train3 $ B.words contents print $ "Built a DB of " ++ show (M.size db) ++ " words"

我想他们都比原始版本快，但我承认，我只是试图对付我发现的第一个合理语料库。

I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.

编辑
根据Travis Brown的非常有效的观点，

EDIT As per Travis Brown's very valid point,

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1

这篇关于优化Haskell代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化Haskell代码 [英] Optimizing Haskell code

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

优化Haskell代码 [英] Optimizing Haskell code

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭