为什么那些谷歌的图像处理样品Renderscript中的Nexus 5运行速度较慢的GPU [英] Why does those Google image processing sample Renderscript runs slower on GPU in Nexus 5

查看:200
本文介绍了为什么那些谷歌的图像处理样品Renderscript中的Nexus 5运行速度较慢的GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要感谢斯蒂芬在previous后很快的回复。这是一个后续问题,这个岗位<一个href="http://stackoverflow.com/questions/20381691/why-very-simple-renderscript-runs-3-times-slower-in-gpu-than-in-cpu">Why很简单Renderscript的运行速度比在CPU 的3倍慢于GPU

I'd like to thank Stephen for the very quick reply in a previous post. This is a follow up question for this post Why very simple Renderscript runs 3 times slower in GPU than in CPU

我的开发平台如下:

Development OS: Windows 7 32-bit
Phone: Nexus 5
Phone OS version: Android 4.4
SDK bundle: adt-bundle-windows-x86-20131030
Build-tool version: 19
SDK tool version: 22.3
Platform tool version: 19

为了评估Renderscript GPU计算性能和把握使得code更快Renderscript的普遍伎俩,我做了以下的测试。

In order to evaluate the performance of Renderscript GPU compute and to grasp the general trick of making code faster by Renderscript, I did the following test.

我检查了来自谷歌的Andr​​oid开源项目的code,使用标签Android的4.2.2_r1.2。我用这个标签只是因为ImageProcessing测试样品是不是在较新的版本可用。

I checked out the code from Google's android open source project, using tag android-4.2.2_r1.2 . I used this tag simply because the ImageProcessing test sample is not available in the newer version.

然后我用在项目的基地\测试\ RenderScriptTests \ ImageProcessing在测试中。我录的运行code性能的GPU和CPU,性能下面列出。

Then I used the project under "base\tests\RenderScriptTests\ImageProcessing" in the test. I recorded the performance of running code on GPU as well CPU and the performance is listed below.

                         GPU    CPU
Levels Vec3 Relaxed     7.45ms  14.89ms
Levels Vec4 Relaxed     6.04ms  12.85ms
Levels Vec3 Full        N/A     28.97ms
Levels Vec4 Full        N/A     35.65ml
Blur radius 25          203.2ms 245.60ms
Greyscale               7.16ms  11.54ms
Grain                   33.33ms 21.73ms
Fisheye Full            N/A     51.55ms
Fisheye Relaxed         92.90ms 45.34ms
Fisheye Approx Full     N/A     51.65ms
Fisheye Approx Relaxed  93.09ms 39.11ms
Vignette Full           N/A     44.17ms
Vignette Relaxed        8.02ms  46.68ms
Vignette Approx Full    N/A     45.04ms
Vignette Approx Relaxed 8.20ms  43.69ms
Convolve 3x3            37.66ms 16.81ms
Convolve 3x3 Intrinsics N/A     4.57ms
ColorMatrix             5.87ms  8.26ms
ColorMatrix Intrinsics  N/A     2.70ms
ColorMatrix Intinsics Grey  N/A 2.52ms
Copy                    5.59ms  2.40ms
CrossProcess(using LUT) N/A     5.74ms
Convolve 5x5            84.25ms 46.59ms
Convolve 5x5 Intrinsics N/A     9.69ms
Mandelbrot              N/A     50.2ms
Blend Intrinsics        N/A     21.80ms

的N / A在表中是由任一满precision或RS内在不上GPU上运行。我们可以看到,其中在GPU上运行13的算法,其中6运行速度较慢的GPU。由于这种code是由谷歌,我会考虑这个现象有点值得研究。至少,我想在code运行速度更快的GPU我从 Renderscript,看到了GPU 不会在这里举行。

我调查了一些在列表中的算法,我想提两种。

I investigated some of the algorithms in the list, I'd like to mention two.

在暗角,对GPU的性能要好得多,我发现这是用在rs_cl.rsh调用多种功能。如果我注释掉那些功能,CPU的运行速度更快(见上面我的previous问题的一个极端的例子)。所以,问题是为什么会这样。在rs_cl.rsh,大部分功能是数学相关的,如EXP,日志,COS等为什么这种函数运行速度快了很多的GPU,这是因为这些功能的实现实际上是高并行或者仅仅因为版本执行在GPU上运行,优于版本的CPU运行?

In Vignette, the performance on GPU is much better, I found this was used by invoking several functions within rs_cl.rsh. If I comment out those functions, CPU will run faster (see my previous question on the top for an extreme case). So the question is why this happens. In rs_cl.rsh, most of the functions are math related, e.g. exp, log, cos, etc. Why such function runs a lot faster on GPU, is this because the implementation of those functions are actually high paralleled or just because the implementation of the version runs on GPU is better than the version runs on CPU?

另一个例子是conv3x3和conv5x5。尽管还有一些其他更聪明的实现比谷歌的版本在本次测试的应用程序,我想这个实现由谷歌肯定是不坏的。它试图尽量减少加法运算和使用了一些便利的功能,从rs_cl.rsh如convert_float4()。因此,一目了然,我认为它会在GPU上运行得更快。但是,它运行速度慢了很多(在Nexus 4和5两者都使用高通的GPU)。我认为,这个例子非常重presentative因为在实现中,算法需要访问像素附近的当前像素。这样的操作是,在许多图像处理算法相当普遍。如果像二维卷积的实现不能进行GPU的速度更快,我怀疑是很多其他的算法也遭受同样的。这将是非常美联社preciated如果你能找出问题所在并提出了一些办法,使这样的算法要快。

Another example is conv3x3 and conv5x5. Though there're other more clever implementation than Google's version in this test app, I think this implementation by Google is certainly not bad. It tries to minimize the addition operation and uses some facilitation function from rs_cl.rsh such as convert_float4(). So at a glance, I assume it will run faster on GPU. However, it runs a lot slower (on Nexus 4 and 5 both using Qualcomm's GPU). I think this example is very representative since in the implementation, the algorithm needs to access the pixels near the current pixel. Such operation is quite common in many image processing algorithms. If the implementation like 2D convolution can't be made faster in GPU, I suspect there're many other algorithms would suffer the same. It would be highly appreciated if you can identify where the problem is and suggest some ways to make such algorithms faster.

在更普遍的问题是,由于测试结果我发现,我想问什么样的准则人应遵循以获得更高的性能,避免尽可能多的性能下降。毕竟,绩效目标是Renderscript的第二个最重要的目标,我认为RS的便携性还是比较不错的。

The more general question is, given the test result I showed, I'd like to ask what kind of criterions people should follow to get the higher performance and avoid the performance degradation as much as possible. After all, the goal of performance is the second most important goal of Renderscript and I think the portability of RS is quite good.

感谢您!

推荐答案

实际上有两个这个问题的答案。

There are really two answers to this question.

1:不要相信关于GPU的炒作。对于一些工作负载他们更快。然而,对于许多工作负荷,所不同的是小的或负的。你至少有2个不同类型的处理器,不用担心哪一个习惯,只是担心,如果性能是你想要的。

1: Don't believe the hype regarding GPUs. For some workloads they are faster. However, for many workloads, the difference is small or negative. You have at least 2 different processor types, don't worry about which one get used, only worry if the performance is what you want.

2:性能调优我真的专注于算法和避免缓慢的操作。例如:

2: For performance tuning I would really focus on the algorithm and avoiding slow operations. Examples:

  • preFER浮翻番时,浮球可提供足够的precision。

  • Prefer float to double when float provides adequate precision.

使用RS_FP_RELAXED当你不需要IEEE-754符合

Use RS_FP_RELAXED when you don't need IEEE-754 compliance

preFER乘法除法

Prefer multiplication to division

使用native_ *(例如:native_powr)代替全precision程序,其中precision足够

use native_* (ex: native_powr) in place of the full precision routines where the precision is adequate

使用rsGetElementAt_ *超过rsSample或rsGetElementAt。了解该类型版本速度更快,一般的获取和远远超过rsSample在许多情况下更快。

Use rsGetElementAt_* over rsSample or rsGetElementAt. The typed version of get are faster that the general get and much faster than rsSample in many cases.

从剧本全局负载通常快于从rs_allocation负荷。 preFER全球内核常量。

loads from script globals are typically faster than loads from an rs_allocation. Prefer global for kernel constants.

3:有全球负载今天的的Nexus(4,5,7v2)GPU路径一些性能问题。这些将与更新的提高。

3: There are some performance issues with global loads today on the Nexus (4,5,7v2) GPU path. These will be improved with updates.

这篇关于为什么那些谷歌的图像处理样品Renderscript中的Nexus 5运行速度较慢的GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆