C#/.NET-最小并行化的Quicksort导致性能下降 [英] C#/.NET - Performance degrading with minimally-parallelized Quicksort

查看:84
本文介绍了C#/.NET-最小并行化的Quicksort导致性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在为List类开发递归并行的Quicksort扩展功能.下面的代码代表了我所考虑的最基本的线程分配标准,因为从概念上讲,它应该是最简单的.它分支到检测到的处理器数量的以2为底的对数的深度,并从那里顺序进行.因此,每个CPU应该获得一个具有(大致)相等,大份额数据的线程来处理,从而避免了过多的开销时间.提供了基本的顺序算法进行比较.

I am currently working on a recursively parallel Quicksort extension function for the List class. The code below represents the most basic thread distribution criteria I've considered because it should be the simplest to conceptually explain. It branches to the depth of the base-2 logarithm of the number of detected processors, and proceeds sequentially from there. Thus, each CPU should get one thread with a (roughly) equal, large share of data to process, avoiding excessive overhead time. The basic sequential algorithm is provided for comparison.

public static class Quicksort
{
    /// <summary>
    /// Helper class to hold information about when to parallelize
    /// </summary>
    /// <attribute name="maxThreads">Maximum number of supported threads</attribute>
    /// <attribute name="threadDepth">The depth to which new threads should
    ///                               automatically be made</attribute>
    private class ThreadInfo
    {
        internal int maxThreads;
        internal int threadDepth;

        public ThreadInfo(int length)
        {
            maxThreads = Environment.ProcessorCount;
            threadDepth = (int)Math.Log(maxThreads, 2);
        }

    }

    /// <summary>
    /// Helper function to perform the partitioning step of quicksort
    /// </summary>
    /// <param name="list">The list to partition</param>
    /// <param name="start">The starting index</param>
    /// <param name="end">The ending index/param>
    /// <returns>The final index of the pivot</returns>
    public static int Partition<T>(this List<T> list, int start, int end) where T: IComparable
    {
        int middle = (int)(start + end) / 2;

        // Swap pivot and first item.
        var temp = list[start];
        list[start] = list[middle];
        list[middle] = temp;
        var pivot = list[start];

        var swapPtr = start;

        for (var cursor = start + 1; cursor <= end; cursor++)
        {
            if (list[cursor].CompareTo(pivot) < 0)
            {
                // Swap cursor element and designated swap element
                temp = list[cursor];
                list[cursor] = list[++swapPtr];
                list[swapPtr] = temp;
            }
        }

        // Swap pivot with final lower item
        temp = list[start];
        list[start] = list[swapPtr];
        list[swapPtr] = temp;

        return swapPtr;
    }

    /// <summary>
    /// Method to begin parallel quicksort algorithm on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    public static void QuicksortParallel<T>(this List<T> list) where T : IComparable
    {
        if (list.Count < 2048)
            list.QuicksortSequential();
        else
        {
            var info = new ThreadInfo(list.Count);
            list.QuicksortRecurseP(0, list.Count - 1, 0, info);
        }
    }

    /// <summary>
    /// Method to implement parallel quicksort recursion on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    /// <param name="start">The starting index of the partition</param>
    /// <param name="end">The ending index of the partition (inclusive)</param>
    /// <param name="depth">The current recursive depth</param>
    /// <param name="info">Structure holding decision-making info for threads</param>
    private static void QuicksortRecurseP<T>(this List<T> list, int start, int end, int depth,
                                             ThreadInfo info)
                                             where T : IComparable
    {
        if (start >= end)
            return;

        int middle = list.Partition(start, end);

        if (depth < info.threadDepth)
        {
            var t = Task.Run(() =>
            {
                list.QuicksortRecurseP(start, middle - 1, depth + 1, info);
            });

            list.QuicksortRecurseP(middle + 1, end, depth + 1, info);
            t.Wait();
        }
        else
        {
            list.QuicksortRecurseS(start, middle - 1);
            list.QuicksortRecurseS(middle + 1, end);
        }
    }

    /// <summary>
    /// Method to begin sequential quicksort algorithm on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    public static void QuicksortSequential<T>(this List<T> list) where T : IComparable
    {
        list.QuicksortRecurseS(0, list.Count - 1);
    }

    /// <summary>
    /// Method to implement sequential quicksort recursion on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    /// <param name="start">The starting index of the partition</param>
    /// <param name="end">The ending index of the partition (inclusive)</param>
    private static void QuicksortRecurseS<T>(this List<T> list, int start, int end) where T : IComparable
    {
        if (start >= end)
            return;

        int middle = list.Partition(start, end);

        // Now recursively sort the (approximate) halves.
        list.QuicksortRecurseS(start, middle - 1);
        list.QuicksortRecurseS(middle + 1, end);
    }
}

据我了解,这种方法应该产生一次性的启动成本,然后以比顺序方法更快的速度对其余数据进行排序.但是,并行方法比顺序方法要花费更多的时间,而顺序方法实际上会随着负载的增加而增加.以4核CPU上的一千万个项目为基准,顺序方法平均需要18秒才能完成,而并行方法最多需要26秒.增加允许的螺纹深度会迅速加剧该问题.

As far as I understand, this methodology should produce a one-time startup cost, then proceed in sorting the rest of the data significantly faster than sequential method. However, the parallel method takes significantly more time than the sequential method, which actually increases as the load increases. Benchmarked at a list of ten million items on a 4-core CPU, the sequential method averages around 18 seconds to run to completion, while the parallel method takes upwards of 26 seconds. Increasing the allowed thread depth rapidly exacerbates the problem.

我们对寻找性能猪的任何帮助表示赞赏.谢谢!

Any help in finding the performance hog is appreciated. Thanks!

推荐答案

问题是CPU缓存冲突,也称为错误共享".

The problem is CPU cache conflict, also known as "false sharing."

除非枢轴点的内存地址碰巧落在本文:

Unless the memory address of the pivot point happens to fall on a cache line, one of the threads will get a lock on the L1 or L2 cache and the other one will have to wait. This can make performance even worse than a serial solution. The problem is described well in this article:

...其中线程使用不同的对象,但是这些对象在内存中恰好足够接近以至于它们落在同一缓存行上,并且缓存系统将它们视为单个块,并由硬件写锁有效地保护了该块一次只能容纳一个核心. [1,2]这会导致真实但不可见的性能争用;无论哪个线程当前具有独占所有权,以便它可以物理地执行对缓存行的更新,将默默地限制正在尝试使用位于同一行上的不同(但可惜,附近)数据的其他线程.

...where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. [1,2] This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.

(剪裁)

在大多数情况下,并行代码的运行实际上比顺序代码慢,并且无论我们将多少内核扔到问题上,我们在任何情况下都没有比42%的加速更好.

In most cases, the parallel code ran actually ran slower than the sequential code, and in no case did we get any better than a 42% speedup no matter how many cores we threw at the problem.

抓草....如果将列表分为两个对象,并且

Grasping at straws.... you may be able to do better if you separate the list into two objects, and pin them in memory (or even just pad them within a struct) so they are far enough apart that they won't have a cache conflict.

这篇关于C#/.NET-最小并行化的Quicksort导致性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆