如何在Nvidia Shield上正确设置Android RenderScript代码的时间 [英] How to do correct timing of Android RenderScript code on Nvidia Shield

查看:200
本文介绍了如何在Nvidia Shield上正确设置Android RenderScript代码的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在RenderScript中实现了一个小型的CNN,并希望分析不同硬件上的性能.在Nexus 7上,时间是有意义的,但在NVIDIA Shield上却没有.

CNN(LeNet)在队列中的9层中实现,计算按顺序进行.每层都是单独计时的.

这里是一个例子:

       conv1  pool1 conv2  pool2 resh1 ip1    relu1  ip2    softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1    0.326  1.557  2.667
shield 13.219 1.024 1.567  1.081 0.988 14.588 13.323 14.318 40.347

时间分布对于联系来说是正确的,conv1和conv2(卷积层)花费了大部分时间.但是在盾牌上,时间已经远远超出了第2到4层的合理范围,并且似乎越来越接近尾声. softmax层是一个相对较小的工作,因此40ms太大了.我的计时方法可能有问题,或者发生了其他事情.

运行各层的代码如下所示:

double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {

    double t = SystemClock.elapsedRealtime(); 
    //long t = System.currentTimeMillis(); // makes no difference

    blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc

    //mRS.finish(); // makes no difference

    t = SystemClock.elapsedRealtime() - t; 
    //t = System.currentTimeMillis() - t; // makes no difference

    times[layerindex] += t; // later we take average etc

    layerindex++;
}

据我了解,一旦forEach_()返回,该工作就应该完成了.无论如何,mRS.finish()应该提供最终的障碍.但纵观时代,唯一合理的解释是作业仍在后台处理.

该应用程序非常简单,我只是从MainActivity运行测试并打印到logcat. Android Studio将应用程序发布为发行版,并在通过USB连接的设备上运行.

(1)对RenderScript进程进行计时的正确方法是什么? (2)是真的,当返回forEach_()时,可以保证脚本产生的线程已经完成了吗? (3)在我的测试应用程序中,我只是直接从MainActivity运行.这是一个问题吗(除了阻塞UI线程并使应用程序无响应之外)?如果这会影响时间或引起怪异,那么像这样设置测试应用程序的正确方法是什么?

解决方案

我自己在RenderScript中实现了CNN,正如您所解释的,如果您实现,它确实需要链接多个进程并为每一层多次调用forEach_*()它们各自作为不同的内核.因此,我可以向您保证,forEach调用返回并不能真正保证该过程已完成.从理论上讲,这只会调度内核,并且所有排队的请求都将在系统确定最佳时真正运行,特别是如果它们在平板电脑的GPU中得到处理.

通常,唯一绝对确保您对真正运行的内核具有某种控制权的唯一方法是通过在各层之间显式读取RS内核的输出,例如通过在输出分配对象上使用.copyTo().该内核.这将强制"尚未运行的任何排队的RS作业(该层的输出分配依赖于该作业)在那时执行.当然,这可能会导致数据传输开销,并且您的时序将不完全准确-实际上,如果以这种方式计时,整个网络的执行时间肯定会比各个层的总和还要短.但是据我所知,这是对链中各个内核进行计时的唯一可靠方法,它会为您提供一些反馈,以找出瓶颈所在,并在此基础上更好地指导您进行优化.

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they do not.

The CNN (LeNet) is implemented in 9 layers residing in a queue, computation is performed in sequence. Each layer is timed individually.

Here is an example:

       conv1  pool1 conv2  pool2 resh1 ip1    relu1  ip2    softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1    0.326  1.557  2.667
shield 13.219 1.024 1.567  1.081 0.988 14.588 13.323 14.318 40.347

The distribution of the times are about right for the nexus, with conv1 and conv2 (convolution layers) taking most of the time. But on the shield, the times drop way beyond what's reasonable for layers 2-4 and seem to gather up towards the end. The softmax layer is a relatively small job, so 40ms is way too large. My timing method must be faulty, or something else is going on.

The code running the layers looks something like this:

double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {

    double t = SystemClock.elapsedRealtime(); 
    //long t = System.currentTimeMillis(); // makes no difference

    blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc

    //mRS.finish(); // makes no difference

    t = SystemClock.elapsedRealtime() - t; 
    //t = System.currentTimeMillis() - t; // makes no difference

    times[layerindex] += t; // later we take average etc

    layerindex++;
}

It is my understanding that once forEach_() returns, the job is supposed to be finished. In any case, mRS.finish() should provide a final barrier. But looking at the times, the only reasonable explanation is that jobs are still processed in the background.

The app is very simple, I just run the test from MainActivity and print to logcat. Android Studio builds the app as a release and runs it on the device which is connected by USB.

(1) What is the correct way to time RenderScript processes? (2) Is it true that when forEach_() returns, the threads spawned by the script are guaranteed to be done? (3) In my test app, I simply run directly from the MainActivity. Is this a problem (other than blocking the UI thread and making the app unresponsive)? If this influences the timing or causes the weirdness, what is a proper way to set up a test app like this?

解决方案

I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.

Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.

这篇关于如何在Nvidia Shield上正确设置Android RenderScript代码的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆