是否有可能在 Haskell 中加速快速排序? [英] Is it possible to speed up a quicksort with par in Haskell?

查看:24
本文介绍了是否有可能在 Haskell 中加速快速排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了这个看似微不足道的并行快速排序实现,代码如下:

I have got this seemingly trivial parallel quicksort implementation, the code is as follows:

import System.Random
import Control.Parallel
import Data.List

quicksort :: Ord a => [a] -> [a]
quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort

-- pQuicksort, parallelQuicksort  
-- As long as n > 0 evaluates the lower and upper part of the list in parallel,
-- when we have recursed deep enough, n==0, this turns into a serial quicksort.
pQuicksort :: Ord a => Int -> [a] -> [a]
pQuicksort _ [] = []
pQuicksort 0 (x:xs) =
  let (lower, upper) = partition (< x) xs
  in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper
pQuicksort n (x:xs) =
  let (lower, upper) = partition (< x) xs
      l = pQuicksort (n `div` 2) lower
      u = [x] ++ pQuicksort (n `div` 2) upper
  in (par u l) ++ u

main :: IO ()
main = do
  gen <- getStdGen
  let randints = (take 5000000) $ randoms gen :: [Int]
  putStrLn . show . sum $ (quicksort randints)

我编译

ghc --make -threaded -O2 quicksort.hs

并运行

./quicksort +RTS -N16 -RTS

无论我做什么,我都无法让它比在一个 CPU 上运行的简单顺序实现运行得更快.

No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu.

  1. 是否可以解释为什么在多个 CPU 上的运行速度比在一个 CPU 上慢得多?
  2. 是否可以通过一些技巧使这种规模(至少是次线性)与 CPU 数量成比例?

@tempestadept 暗示快速排序是问题所在.为了检查这一点,我本着与上面的示例相同的精神实现了一个简单的合并排序.它具有相同的行为,添加的功能越多,执行速度越慢.

@tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add.

import System.Random
import Control.Parallel

splitList :: [a] -> ([a], [a])
splitList = helper True [] []
  where helper _ left right [] = (left, right)
        helper True  left right (x:xs) = helper False (x:left) right xs
        helper False left right (x:xs) = helper True  left (x:right) xs

merge :: (Ord a) => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) = case x<y of
  True  -> x : merge xs (y:ys)
  False -> y : merge (x:xs) ys

mergeSort :: (Ord a) => [a] -> [a]
mergeSort xs = pMergeSort 16 xs -- we use 16 sparks

-- pMergeSort, parallel merge sort. Takes an extra argument
-- telling how many sparks to create. In our simple test it is
-- set to 16
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
  let (left, right) = splitList xs
  in  merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
  let (left, right) = splitList xs
      l = pMergeSort (n `div` 2) left
      r = pMergeSort (n `div` 2) right
  in  (r `par` l) `pseq` (merge l r)

ris :: Int -> IO [Int]
ris n = do
  gen <- getStdGen
  return . (take n) $ randoms gen

main = do
  r <- ris 100000
  putStrLn . show . sum $ mergeSort r

推荐答案

鉴于@lehins 的出色回答,我不确定这是否值得注意,但是...

I'm not sure this is worth noting, given @lehins excellent answer, but...

您的 pQuickSort 有两个大问题.第一个是您正在使用 System.Random,它很慢并且与并行排序奇怪地交互(见下文).第二个是你的 par ul 引发了一个计算来评估:

There are two big problems with your pQuickSort. The first is that you're using System.Random, which is bog slow and interacts strangely with a parallel sort (see below). The second is that your par u l sparks a computation to evaluate:

u = [x] ++ pQuicksort (n `div` 2) upper

到 WHNF,即 u = x : UNEVALUATED_THUNK,所以你的火花没有做任何实际工作.

to WHNF, namely u = x : UNEVALUATED_THUNK, so your sparks aren't doing any real work.

事实上,当并行化一个简单的、非原位的、伪快速排序时,不难观察到性能的提高.如前所述,一个重要的考虑因素是避免使用 System.Random.使用快速 LCG,我们可以对实际排序时间进行基准测试,而不是排序和随机数生成的某种奇怪混合.以下伪快速排序:

In fact, it's not difficult to observe a performance improvement when parallelizing a naive, not-in-place, pseudo-quicksort. As mentioned, an important consideration is to avoid using System.Random. With a fast LCG, we can benchmark the actual sort time, rather than some weird mixture of sort and random number generation. The following pseudo-quicksort:

import Data.List

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
    in qsort a ++ x:qsort b
qsort [] = []

randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  let randints = randomList 5000000
  print . sum $ qsort randints

使用 GHC 8.6.4 和 -O2 编译时,运行时间约为 9.7 秒.以下并行化"版本:

when compiled with GHC 8.6.4 and -O2, runs in about 9.7 seconds. The following "parallelized" version:

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
        a' = qsort a
        b' = qsort b
    in (b' `par` a') ++ x:b'
qsort [] = []

使用 ghc -O2 -threaded 编译,在一项功能上运行大约 11.0 秒.添加+RTS -N4,运行7.1秒.

compiled with ghc -O2 -threaded runs in about 11.0 seconds on one capability. Add +RTS -N4, and it runs in 7.1 seconds.

达达!改进.

(相比之下,带有 System.Random 的版本对于非并行版本的运行时间约为 13 秒,对于一种功能的并行版本运行时间约为 12 秒(可能只是因为一些小的严格改进),并且随着每增加一个附加功能而显着减慢;时间也是不稳定的,虽然我不太确定为什么.)

(In contrast, the version with System.Random runs in about 13 seconds for the non-parallel version, about 12 seconds for the parallel version on one capability (probably just because of some minor strictness improvement), and slows down substantially for each additional capability added; timings are erratic, too, though I'm not quite sure why.)

这个版本的一个明显问题是,即使 a' = qsort ab' = qsort b 并行运行,它们也绑定到相同的顺序 partition 调用.通过将其分为两个过滤器:

One obvious problem with this version is that, even with a' = qsort a and b' = qsort b running in parallel, they're tied to the same sequential partition call. By dividing this up into two filters:

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let a = qsort $ filter (<=x) xs
        b = qsort $ filter (>x)  xs
    in b `par` a ++ x:b
qsort [] = []

我们使用 -N4 将速度加快了大约 5.5 秒.公平地说,即使 非并行 版本实际上用两个 filters 代替 partition 调用实际上稍微快一点,至少在排序时 <代码>整数.与分区相比,过滤器可能还有一些额外的优化是可能的,这使得额外的比较值得.

we speed things up to about 5.5 seconds with -N4. To be fair, even the non-parallel version is actually slightly faster with two filters in place of the partition call, at least when sorting Ints. There are probably some additional optimizations that are possible with the filters compared to the partition that make the extra comparisons worth it.

现在,您在上面的 pQuickSort 中尝试做的是将并行计算限制为最顶层的递归集.让我们使用以下 psort 来试验一下:

Now, what you tried to do in pQuickSort above was to limit the parallel computations to the top-most set of recursions. Let's use the following psort to experiment with this:

psort :: Ord a => Int -> [a] -> [a]
psort n (x:xs)
  = let a = psort (n-1) $ filter (<=x) xs
        b = psort (n-1) $ filter (>x)  xs
    in if n > 0 then b `par` a ++ x:b else a ++ x:b
psort _ [] = []

这将并行化递归的顶部 n 层.我的特定 LCG 示例的种子为 1(即 iterate lcg 1)递归多达 54 层,因此 psort 55 应该提供与完全并行版本相同的性能,除了用于跟踪图层的开销.当我运行它时,我用 -N4 获得了大约 5.8 秒的时间,所以开销非常小.

This will parallelize the top n layers of the recursion. My particular LCG example with a seed of 1 (i.e., iterate lcg 1) recurses up to 54 layers, so psort 55 should give the same performance as the fully parallel version except for the overhead of keeping track of layers. When I run it, I get a time of about 5.8 seconds with -N4, so the overhead is quite small.

现在,看看当我们减少层数时会发生什么:

Now, look what happens as we reduce the number of layers:

| Layers |  55 |  40 |  30 |  20 |  10 |   5 |   3 |    1 |
|--------+-----+-----+-----+-----+-----+-----+-----+------|
| time   | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |

请注意,在最低层,并行计算几乎没有什么好处.这主要是因为树的平均深度可能大约在 25 层左右,所以只有少数 50 层的计算,其中许多具有奇怪的、不对称的分区,而且它们肯定也是小到并行化.另一方面,那些额外的 par 调用似乎没有任何惩罚.

Note that, at the lowest layers, there's little to be gained from parallel computation. This is mostly because the average depth of the tree is probably around 25 layers or so, so there's only a handful of computations at 50 layers many with weird, lop-sided partitions, and they're certainly too small to parallelize. On the flip side, there doesn't seem to be any penalty for those extra par calls.

与此同时,增益一直增加到至少 20 层,因此试图人为地将火花的总数限制为 16(例如,前 4 或 5 层)是一个很大的损失.

Meanwhile, there are increasing gains all the way down to at least 20 layers, so trying to artificially limit the total number of sparks to 16 (e.g., top 4 or 5 layers), is a big loss.

这篇关于是否有可能在 Haskell 中加速快速排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆