iOS并发 - 未达到理论最大值附近 [英] iOS Concurrency - Not reaching anywhere's near theoretical maximum

查看:175
本文介绍了iOS并发 - 未达到理论最大值附近的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚接触到Grand Central Dispatch,并且已经运行一些测试,它对图像做了一些处理。基本上,我运行的灰度算法,顺序和使用GCD和比较的结果。



这里是基本循环:

  UInt8 r,g, b; 
uint pixelIndex;
for(uint y = 0; y< height; y ++){
for(uint x = 0; x pixelIndex =(uint) width + x);

if(pixelIndex + 2< width * height){
sourceDataPtr =& sourceData [pixelIndex];

r = sourceDataPtr [0 + 0];
g = sourceDataPtr [0 + 1];
b = sourceDataPtr [0 + 2];

int value =(r + g + b)/ 3;
if(value> MAX_COLOR_VALUE){
value = MAX_COLOR_VALUE;
}

targetData [pixelIndex] = value;
self.imageData [pixelIndex] = value;
}
}
}

红色,绿色和蓝色的平均值。蓝色,并将其用于灰度值。很简单。现在并行版本的basiclaly将图像分成几部分,然后单独计算这些部分。即2,4,8,16& 32份。我使用基本的GCD所以传递每个部分,因为它自己的块并发运行。这是GCD包装代码:

  dispatch_group_t myTasks = dispatch_group_create 

for(int startX = 0; startX for(int startY = 0; startY< height; startY + /self.numVerticalSegments){
//对于每个段,使用一个代码块进行计算。
dispatch_group_async(myTasks,dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH,0),^ {
// grayscale code ...
});
}
}
dispatch_group_wait(myTasks,DISPATCH_TIME_FOREVER);

一切正常。但是我不明白的是加速/ CPU使用。在模拟器中运行测试(使用我的双核CPU)我得到:




  • 〜0.0945s运行时间

  • 〜0.0675秒使用GCD的运行时间



这是一个加速度约28% %顺序版本的时间)。
理论上,在双核机器上100%的加速是最大的。所以这是下降得很快,我不知道为什么。



我监控CPU使用率,它最大约为118% - 为什么它不接近200%?如果任何人有一个想法,我应该改变,或什么是罪魁祸首在这里,我会非常感谢。



我的理论:




  • 没有足够的CPU工作3,150,000像素)

  • 没有足够的时间达到接近200%?也许每个线程需要一个更长的运行时间,才开始咀嚼这么多的CPU?

  • 我想也许开销是相当高的,但是一个测试发射32个空块到一个队列(也在一个组)最多约0.0005秒。


解决方案在我的测试中,我发现如果我只关注并发B& W转换,我实现了接近你所期望的两倍的速度(并行再现占用53%,只要连续再现)。当我还包括转换的辅助部分(不仅是转换,而且是图像的检索,输出像素缓冲区的准备和新图像的创建等),那么所得到的性能改进不太壮观,其中经过的时间是连续再现的79%。



为什么你不能实现性能的绝对翻倍,即使你只专注于可以享受并发性的部分,Apple将此行为归因于调度代码中的开销以供执行。在讨论中,在 dispatch_apply 在< em>并发编程指南< em>中的OperationQueues.html#// apple_ref / doc / uid / TP40008091-CH102-SW23rel =nofollow>同时执行循环迭代,并发任务的性能增益和每个调度块所需的开销:


你应该确保你的任务代码有一个合理的数量通过每次迭代工作。与分派到队列的任何块或函数一样,调度该代码以执行也有开销。如果你的循环的每次迭代只执行少量的工作,调度代码的开销可能超过从分派它到队列可能实现的性能优势。如果您在测试期间发现这是真的,您可以使用striding增加每次循环迭代期间执行的工作量。使用striding,您可以将原始循环的多个迭代分组到单个块中,并按比例减少迭代计数。例如,如果最初执行100次迭代,但决定使用4的步幅,则现在对每个块执行4次循环迭代,并且迭代计数为25.有关如何实现stride的示例,请参见改进循环代码


另外,我认为可能值得考虑创建自己的并发队列, code> dispatch_apply 。它正是为此目的而设计的,优化 for 循环可以享受并发。






这是我用于我的基准测试的代码:

   - (UIImage *)convertImage: )image algorithm:(NSString *)algorithm 
{
CGImageRef imageRef = image.CGImage;
NSAssert(imageRef,@无法获取CGImageRef);

CGDataProviderRef provider = CGImageGetDataProvider(imageRef);
NSAssert(provider,@无法获取提供程序);

NSData * data = CFBridgingRelease(CGDataProviderCopyData(provider));
NSAssert(data,@无法复制图像数据);

NSInteger bitsPerComponent = CGImageGetBitsPerComponent(imageRef);
NSInteger bitsPerPixel = CGImageGetBitsPerPixel(imageRef);
CGBitmapInfo bitmapInfo = CGImageGetBitmapInfo(imageRef);
NSInteger bytesPerRow = CGImageGetBytesPerRow(imageRef);
NSInteger width = CGImageGetWidth(imageRef);
NSInteger height = CGImageGetHeight(imageRef);
CGColorSpaceRef colorspace = CGImageGetColorSpace(imageRef);

void * outputBuffer = malloc(width * height * bitsPerPixel / 8);
NSAssert(outputBuffer,@无法分配缓冲区);

uint8_t * buffer =(uint8_t *)[data bytes];

CFAbsoluteTime start = CFAbsoluteTimeGetCurrent();

if([algorithm isEqualToString:kImageAlgorithmSimple]){
[self convertToBWSimpleFromBuffer:buffer toBuffer:outputBuffer width:width height:height];
} else if([algorithm isEqualToString:kImageAlgorithmDispatchApply]){
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:2];
} else if([algorithm isEqualToString:kImageAlgorithmDispatchApply4]){
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:4];
} else if([algorithm isEqualToString:kImageAlgorithmDispatchApply8]){
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:8];
}

NSLog(@%@:%.2f,algorithm,CFAbsoluteTimeGetCurrent() - start);

CGDataProviderRef outputProvider = CGDataProviderCreateWithData(NULL,outputBuffer,sizeof(outputBuffer),releaseData);

CGImageRef outputImageRef = CGImageCreate(width,
height,
bitsPerComponent,
bitsPerPixel,
bytesPerRow,
colorpace,
bitmapInfo ,
outputProvider,
NULL,
NO,
kCGRenderingIntentDefault);

UIImage * outputImage = [UIImage imageWithCGImage:outputImageRef];

CGImageRelease(outputImageRef);
CGDataProviderRelease(outputProvider);

return outputImage;
}

/ **将图像作为单个(非并行)任务转换为B& W。
*
*假设像素缓冲区是RGBA,每像素8位格式。
*
* @param inputButter输入像素缓冲区。
* @param outputBuffer输出像素缓冲区。
* @param width图像宽度(以像素为单位)。
* @param height图像高度(以像素为单位)。
* /
- (void)convertToBWSimpleFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height
{
(NSInteger col = 0; col< width; col ++){

NSUInteger offset =(col + row * width)* 4;
uint8_t * rgba = inputBuffer + offset;

uint8_t red = rgba [0];
uint8_t green = rgba [1];
uint8_t blue = rgba [2];
uint8_t alpha = rgba [3];

uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;

outputBuffer [offset] = gray;
outputBuffer [offset + 1] = gray;
outputBuffer [offset + 2] = gray;
outputBuffer [offset + 3] = alpha;
}
}
}

/ **将图像转换为B& W,使用GCD将转换拆分为多个并发GCD任务。
*
*假设像素缓冲区是RGBA,每像素8位格式。
*
* @param inputButter输入像素缓冲区。
* @param outputBuffer输出像素缓冲区。
* @param width图像宽度(以像素为单位)。
* @param height图像高度(以像素为单位)。
* @param count应将转换分成多少个GCD任务。
* /
- (void)convertToBWConcurrentFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height count:(NSInteger)count
{
dispatch_queue_t queue = dispatch_queue_create(com.domain.app,DISPATCH_QUEUE_CONCURRENT);
NSInteger stride = height / count;

dispatch_apply(height / stride,queue,^(size_t idx){

size_t j = idx * stride;
size_t j_stop = MIN高度);

for(NSInteger row = j; row< j_stop; row ++){

for(NSInteger col = 0; col< width; col ++){

NSUInteger offset =(col + row * width)* 4;
uint8_t * rgba = inputBuffer + offset;

uint8_t red = rgba [0];
uint8_t green = rgba [1];
uint8_t blue = rgba [2];
uint8_t alpha = rgba [3];

uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;

outputBuffer [offset] = gray;
outputBuffer [offset + 1] = gray;
outputBuffer [offset + 2] = gray;
outputBuffer [offset + 3] = alpha;
}
}
});

return YES;
}

void releaseData(void * info,const void * data,size_t size)
{
free((void *)data);
}

在iPhone 5上,转换7360×4912图像需要2.24秒使用简单的串行方法,当我使用两个循环使用 dispatch_apply 时,需要1.18秒。当我尝试4或8 dispatch_apply 循环时,我没有看到进一步的性能提升。


I'm new to Grand Central Dispatch and have been running some tests with it doing some processing on an image. Basically I'm running a grayscale algorithm both sequentially and using GCD and comparing the results.

here is the basic loop:

UInt8 r,g,b;
uint pixelIndex;
for (uint y = 0; y < height; y++) {
    for (uint x = 0; x < width; x++) {
        pixelIndex = (uint)(y * width + x);

        if (pixelIndex+2 < width * height) {
            sourceDataPtr = &sourceData[pixelIndex];

            r = sourceDataPtr[0+0];
            g = sourceDataPtr[0+1];
            b = sourceDataPtr[0+2];

            int value = (r+g+b) / 3;
            if (value > MAX_COLOR_VALUE) {
                value = MAX_COLOR_VALUE;
            }

            targetData[pixelIndex] = value;
            self.imageData[pixelIndex] = value;
        }
    }
}

It simply runs through and takes the average value for Red, Green & Blue and uses that for the gray value. Very Simple. Now the parallel version basiclaly breaks the image into portions and then computes those portions seperately. Namely 2, 4, 8, 16 & 32 portions. I'm using the basic GCD so pass each portion in as it's own block to run concurrently. Here is the GCD wrapped code:

dispatch_group_t myTasks = dispatch_group_create();

for (int startX = 0; startX < width; startX += width/self.numHorizontalSegments) {
    for (int startY = 0; startY < height; startY += height/self.numVerticalSegments) {
        // For each segment, enqueue a block of code to compute it.
        dispatch_group_async(myTasks, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
             // grayscale code...
        });
    }
}
dispatch_group_wait(myTasks, DISPATCH_TIME_FOREVER); 

Everything is working fine. But what I am not understanding is the speedup / CPU usage. Running tests in the simulator (which is using my dual core CPU) I am getting:

  • ~0.0945s run time sequentially
  • ~0.0675s run time using GCD

This is a speedup of around ~28% (a.k.a taking 72% the time of the sequential version). Theoretically, on a 2-core machine 100% speedup is the maximum. So this is falling well short of that and I can't figure out why.

I monitor the CPU usage and it maxes out around 118% - why is it not reaching closer to 200%? If anyone has an idea as to what I should change, or what is the culprit here I would greatly appreciate it.

My Theories:

  • Not enough work on CPU (but image is ~3,150,000 pixels)
  • Not enough time to fire up to near 200%? Maybe each thread requires a longer runtime before it starts chewing up that much of the CPU?
  • I thought maybe the overhead was pretty high, but a test of launching 32 empty blocks to a queue (also in a group) took around ~0.0005s maximum.

解决方案

In my tests, I found that if I just focused on the concurrent B&W conversion, I achieved something close to the "twice the speed" that you were expecting (the parallel rendition took 53% as long as the serial rendition). When I also included the ancillary portions of the conversion (not only the conversion, but also the retrieval of the image, preparation of the output pixel buffer, and creation of the new image, etc.), then the resulting performance improvement was less spectacular, where elapsed time was 79% as long as the serial rendition.

In terms of why you might not achieve an absolute doubling of performance, even if you just focus on the portion that can enjoy concurrency, Apple attributes this behavior to the overhead in scheduling code for execution. In their discussion about using dispatch_apply in the Performing Loop Iterations Concurrently in the Concurrency Programming Guide, they contemplate the balance between the performance gain of concurrent tasks and the overhead that each dispatched block entails:

You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25. For an example of how to implement striding, see "Improving on Loop Code."

As an aside, I think it might be worth considering creating your own concurrent queue and using dispatch_apply. It is designed for precisely this purpose, optimizing for loops that can enjoy concurrency.


Here is my code that I used for my benchmarking:

- (UIImage *)convertImage:(UIImage *)image algorithm:(NSString *)algorithm
{
    CGImageRef imageRef = image.CGImage;
    NSAssert(imageRef, @"Unable to get CGImageRef");

    CGDataProviderRef provider = CGImageGetDataProvider(imageRef);
    NSAssert(provider, @"Unable to get provider");

    NSData *data = CFBridgingRelease(CGDataProviderCopyData(provider));
    NSAssert(data, @"Unable to copy image data");

    NSInteger       bitsPerComponent = CGImageGetBitsPerComponent(imageRef);
    NSInteger       bitsPerPixel     = CGImageGetBitsPerPixel(imageRef);
    CGBitmapInfo    bitmapInfo       = CGImageGetBitmapInfo(imageRef);
    NSInteger       bytesPerRow      = CGImageGetBytesPerRow(imageRef);
    NSInteger       width            = CGImageGetWidth(imageRef);
    NSInteger       height           = CGImageGetHeight(imageRef);
    CGColorSpaceRef colorspace       = CGImageGetColorSpace(imageRef);

    void *outputBuffer = malloc(width * height * bitsPerPixel / 8);
    NSAssert(outputBuffer, @"Unable to allocate buffer");

    uint8_t *buffer = (uint8_t *)[data bytes];

    CFAbsoluteTime start = CFAbsoluteTimeGetCurrent();

    if ([algorithm isEqualToString:kImageAlgorithmSimple]) {
        [self convertToBWSimpleFromBuffer:buffer toBuffer:outputBuffer width:width height:height];
    } else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply]) {
        [self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:2];
    } else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply4]) {
        [self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:4];
    } else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply8]) {
        [self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:8];
    }

    NSLog(@"%@: %.2f", algorithm, CFAbsoluteTimeGetCurrent() - start);

    CGDataProviderRef outputProvider = CGDataProviderCreateWithData(NULL, outputBuffer, sizeof(outputBuffer), releaseData);

    CGImageRef outputImageRef = CGImageCreate(width,
                                              height,
                                              bitsPerComponent,
                                              bitsPerPixel,
                                              bytesPerRow,
                                              colorspace,
                                              bitmapInfo,
                                              outputProvider,
                                              NULL,
                                              NO,
                                              kCGRenderingIntentDefault);

    UIImage *outputImage = [UIImage imageWithCGImage:outputImageRef];

    CGImageRelease(outputImageRef);
    CGDataProviderRelease(outputProvider);

    return outputImage;
}

/** Convert the image to B&W as a single (non-parallel) task.
 *
 * This assumes the pixel buffer is in RGBA, 8 bits per pixel format.
 *
 * @param inputButter  The input pixel buffer.
 * @param outputBuffer The output pixel buffer.
 * @param width        The image width in pixels.
 * @param height       The image height in pixels.
 */
- (void)convertToBWSimpleFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height
{
    for (NSInteger row = 0; row < height; row++) {

        for (NSInteger col = 0; col < width; col++) {

            NSUInteger offset = (col + row * width) * 4;
            uint8_t *rgba = inputBuffer + offset;

            uint8_t red   = rgba[0];
            uint8_t green = rgba[1];
            uint8_t blue  = rgba[2];
            uint8_t alpha = rgba[3];

            uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;

            outputBuffer[offset]     = gray;
            outputBuffer[offset + 1] = gray;
            outputBuffer[offset + 2] = gray;
            outputBuffer[offset + 3] = alpha;
        }
    }
}

/** Convert the image to B&W, using GCD to split the conversion into several concurrent GCD tasks.
 *
 * This assumes the pixel buffer is in RGBA, 8 bits per pixel format.
 *
 * @param inputButter  The input pixel buffer.
 * @param outputBuffer The output pixel buffer.
 * @param width        The image width in pixels.
 * @param height       The image height in pixels.
 * @param count        How many GCD tasks should the conversion be split into.
 */
- (void)convertToBWConcurrentFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height count:(NSInteger)count
{
    dispatch_queue_t queue = dispatch_queue_create("com.domain.app", DISPATCH_QUEUE_CONCURRENT);
    NSInteger stride = height / count;

    dispatch_apply(height / stride, queue, ^(size_t idx) {

        size_t j = idx * stride;
        size_t j_stop = MIN(j + stride, height);

        for (NSInteger row = j; row < j_stop; row++) {

            for (NSInteger col = 0; col < width; col++) {

                NSUInteger offset = (col + row * width) * 4;
                uint8_t *rgba = inputBuffer + offset;

                uint8_t red   = rgba[0];
                uint8_t green = rgba[1];
                uint8_t blue  = rgba[2];
                uint8_t alpha = rgba[3];

                uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;

                outputBuffer[offset]     = gray;
                outputBuffer[offset + 1] = gray;
                outputBuffer[offset + 2] = gray;
                outputBuffer[offset + 3] = alpha;
            }
        }
    });

    return YES;
}

void releaseData(void *info, const void *data, size_t size)
{
    free((void *)data);
}

On an iPhone 5, this took 2.24 seconds to convert a 7360 × 4912 image with the simple, serial method, and took 1.18 seconds when I used dispatch_apply with two loops. When I tried 4 or 8 dispatch_apply loops, I saw no further performance gain.

这篇关于iOS并发 - 未达到理论最大值附近的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆