强制执行默认CPU时,RenderScript速度提高10倍 [英] RenderScript speedup 10x when forcing default CPU implementation

查看:185
本文介绍了强制执行默认CPU时,RenderScript速度提高10倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在RenderScript中实现了CNN,如上一个问题,它产生了这个问题。基本上,在运行时

  adb shell setprop debug.rs.default-CPU-driver 1 

Nvidia Shield和Nexus 7的速度都提高了10倍。平均计算时间从大约50ms到5ms,测试应用从50fps至130或更高。卷积算法有两种:



(1)移动内核

(2)im2col和RenderScriptIntrinsicsBLAS的GEMM。



两者都经历了类似的加速。问题是:为什么会发生这种情况,并且可以以可预测的方式从代码中实例化这种影响?在某个地方有关于此的详细信息吗?



编辑:



根据以下建议,我验证了finish()和copyTo()的使用,这是该程序的分解。报告的加速是在调用copyTo()之后,但没有finish()。取消注释finish()会增加大约1ms的时间。

  double forwardTime = 0; 
long t = System.currentTimeMillis();
// double t = SystemClock.elapsedRealtime(); //对于(Layer a:layers){
blob = a.forward(blob);
}
mRS.finish(); //将测量时间增加大约1毫秒

blob.copyTo(outbuf);
forwardTime = System.currentTimeMillis()-t;

也许这无关紧要,但是在NVIDIA Shield上,我在启动时收到一条错误消息,当使用adb shell setprop debug.rs.default-CPU-driver 1运行时,该消息消失了

  E / Renderscript:rsAssert失败:0,在vendor / nvidia / tegra / compute / rs / driver / nv / rsdNvBcc.cpp 

我现在使用buildToolsVersion 23.0.2将compileSdkVersion,minSdkVersion和targetSdkVersion设置为23。平板电脑会自动更新到最新的Android版本。不确定我需要设置的最低目标,并且仍然可以使用ScriptIntrinsicsBLAS。



我在所有脚本中都使用#pragma rs_fp_relaxed。所有分配均使用默认标志。

此问题的情况与此类似,但事实证明OP在每个计算回合中都在创建新的Script对象。我什么也不做,所有脚本和分配都是在初始化时创建的。

解决方案

原始帖子包含mRS.finish ()已注释掉。我想知道是否是这种情况。



要正确地基准化RenderScript,我们应该等待未完成的异步操作完成。通常有两种方法可以做到这一点:


  1. 使用 RenderScript.finish()。使用 debug.rs.default-CPU-driver 1 时,此方法效果很好。而且它也适用于大多数GPU驱动程序。但是,某些GPU驱动程序确实将此视为NOOP。

  2. 使用 Allocation.copyTo()或其他类似的API来访问分配的数据,最好是最终输出分配的数据。这实际上是一个技巧,但它适用于所有设备。请注意,copyTo操作本身可能要花费一些时间,并确保您考虑到了这一点。视实际算法而定可能是真实的。但是值得再次检查一下,当您添加finish()或copyTo()时是否仍然如此。


    I have implemented a CNN in RenderScript, described in a previous question which spawned this one. Basically, when running

    adb shell setprop debug.rs.default-CPU-driver 1
    

    there is a 10x speedup on both Nvidia Shield and Nexus 7. The average computation time goes from around 50ms to 5ms, the test app goes from around 50fps to 130 or more. There are two convolution algorithms:

    (1) moving kernel
    (2) im2col and GEMM from RenderScriptIntrinsicsBLAS.

    Both experience similar speedup. The question is: why is this happening and can this effect be instantiated from the code in a predictable way? And is detailed information about this available somewhere?

    Edit:

    As per suggestions below, I verified the use of finish() and copyTo(), here is a breakdown of the procedure. The speedup reported is AFTER the call to copyTo() but without finish(). Uncommenting finish() adds about 1ms to the time.

    double forwardTime = 0;
    long t = System.currentTimeMillis();
    //double t = SystemClock.elapsedRealtime(); // makes no difference
    for (Layer a : layers) {
        blob = a.forward(blob);
    }
    mRS.finish();   // adds about 1ms to measured time 
    
    blob.copyTo(outbuf);
    forwardTime = System.currentTimeMillis() - t;​
    

    Maybe this is unrelated, but on the NVIDIA Shield I get an error message at startup which disappears when running with adb shell setprop debug.rs.default-CPU-driver 1

    E/Renderscript: rsAssert failed: 0, in vendor/nvidia/tegra/compute/rs/driver/nv/rsdNvBcc.cpp
    

    I'm setting compileSdkVersion, minSdkVersion and targetSdkVersion to 23 right now, with buildToolsVersion "23.0.2". The tablets are autoupdated to the very latest Android version. Not sure about the minimum target I need to set and still have ScriptIntrinsicsBLAS available.

    I'm using #pragma rs_fp_relaxed in all scripts. The Allocations all use default flags.
    This question has a similar situation, but it turned out OP was creating new Script objects every computational round. I do nothing of the sort, all Scripts and Allocations are created at init time.

    解决方案

    The original post has the mRS.finish() commented out. I am wondering if that is the case here.

    To benchmark RenderScript properly, we should wait for pending asynchronous opeations to complete. There are generally two ways to do that:

    1. Use RenderScript.finish(). This works well when using debug.rs.default-CPU-driver 1. And it also works with most GPU drivers. However, certain GPU driver does treat this as a NOOP.
    2. Use Allocation.copyTo() or other similar APIs to access data of an Allocation, preferably the final output Allocation. This is actually a trick, but it works on all devices. Just be aware, the copyTo operation itself may take some time and make sure you take that into consideration.

    5ms here seems suspicious, it might be real depending on the actually algorithm. But it worth double check if it is still the case when you add finish() or copyTo().

    这篇关于强制执行默认CPU时,RenderScript速度提高10倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆