计算在阵列中使用renderscript值的总和 [英] Calculate the sum of values in an array using renderscript

查看:850
本文介绍了计算在阵列中使用renderscript值的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我是一个新手,并试图code中的 Renderscript 。我想知道我怎么可以在阵列中使用渲染脚本执行要素的总和。有没有一种方法,我可以通过输出回脚本依次加入?我的问题的说法是:
 矢量和

Hi I am a newbie and trying to code in Renderscript. I would want to know how can I perform a sum of elements in an array using render script. Is there a way I can pass the output back into the script for sequential addition? my problem statement is: Vector Sum

描述:计算值的总和以阵列

Description: Calculate the sum of values in an array.

输入:整数数组

输出:整数

任何帮助将是非常美联社preciated!

Any help would be much appreciated!

推荐答案

这恐怕是更复杂一点似乎比,但我会尽我所能解释这里,你可以采取实施可能途径这一点。

I'm afraid this is a bit more complex than it seems, but I'll do my best to explain here a possible route that you can take to implement this.

你所要求的是更好地称为并行还原算法,它可以实现两种阵列和作为你的情况,或者,当一个数组迭代应用任何其他可交换+联想运营商,将减少它到一个数字。其它实例中都发现了大量阵列的最大或最小值。在CUDA和OpenCL,对于这个计算是能够使尽可能最好地利用并行线程的一个众所周知的模式,如果你谷歌CUDA减少,例如,你会得到这个算法吨有用的信息。

What you are asking for is better known as the parallel reduction algorithm, which can implement either an array sum as in your case, or any other commutative + associative operator which, when applied iteratively over an array, will "reduce" it to a single number. Other examples are finding the maximum or minimum values of a large array. In CUDA and OpenCL, there is a well known pattern for this computation that is capable of making the best possible use of parallel threads, and if you google "CUDA reduction" for example, you'll get tons of useful info on this algorithm.

这是实现的方法是通过反复减少一半的阵列,一遍又一遍,直到你结束了一个单一的元素。每次减少它的时候,每一个新的元素是二previous元素的总和。下面是更好的描述了这个算法的图片:

The way this is implemented is by repeatedly reducing the array in half, over and over again, until you end up with a single element. Each time you reduce it, each new element is the sum of two previous elements. Here's a picture that better depicts this algorithm:

因此​​,例如,你开始用16个元素的数组。您运行算法一次,你最终有8个元素的数组 - 其中每个8元的是两个数字从原来的数组的总和。

So for example, you start with a 16-element array. You run the algorithm once, and you end up with an 8-element array -- where each of these 8 elements is the sum of two numbers from the original array.

您再次运行它,你结束了4个元素 - 在每一种是两个数字从previous一步的总和。等等...

You run it again, and you end up with 4 elements -- where each of these is the sum of two numbers from the previous step. And so on...

您继续这样做,直到你最终只有一个号码 - 你的总金额

You keep doing this, until you end up with only one number -- your total sum.

这是在RenderScript实施,这将是低效的方式:

Java的:

int[] ints; // Your data is held here.

Allocation data = Allocation.createSized(rs, Element.I32(rs), ints.length, Allocation.USAGE_SCRIPT);
data.copy1DRangeFrom(0, ints.length, ints);

ScriptC_Reduce script = new ScriptC_Reduce(rs);
script.bind_data(data);

for (int stride = ints.length / 2; stride > 0; stride /= 2) {
    script.set_stride(stride);
    script.forEach_root(input, output);
}

data.copyTo(ints);
int totalsum = ints[0];

RenderScript:

RenderScript:

#pragma version(1)
#pragma rs java_package_name(...[your package here]...)

int stride;
int * data;

void root(const int32_t *v_in, int32_t *v_out, uint32_t x) {
    if (x < stride) data[x] += data[x + stride];
}

如果您已经与RS工作过,您可能会注意到一对夫妇奇怪的事情:

If you've worked with RS before, you may notice a couple strange things:


  1. 注意,在RS内核V_IN和V_OUT不会在所有使用的,因为它们被限制在读取和写入对应于当前线程索引的数据元素,而减少算法需要在访问数据元素其他职位。因此,有一个int数组指针的数据,这是从Java绑定从具有相同名称的配置,那就是内核直接作用于什么。

  2. 内核从Java中的一个循环多次调用,而不是做内核内部的循环。这是因为在每次迭代中,一切从previuos步骤数据必须准备好已其预期位置,否则,数据[X +步幅]将不同步的。在RS,内核调用锁,这意味着没有别的执行,直到内核完成处理整个数据。这是类似于__syncthreads()会做一个CUDA内核里面,如果你熟悉。

我上面提到的,然而,这是一种低效率的实现。但它应该指向你在正确的方向。为了使之更有效率,您可能需要分割数据成小块单独计算,因为这里给出,该算法将在每次迭代运行的线程数量ints.length,并在非常大的阵列,这将导致一个的很多的步骤和AA的很多的空闲线程的每一步。

I mentioned above, however, that this is an inefficient implementation. But it should point you in the right direction. To make it more efficient, you might need to split the data into smaller chunks to be computed separately, because as given here, this algorithm would run ints.length number of threads at each iteration step, and on very large arrays that will result in a lot of steps, and a a lot of idle threads at each step.

此外,这种假设阵列的长度正好是2的幂,以便多个halvings将导致恰好一个要素。对于其他大小的数组,你可能需要0垫阵列。而在这里,工作在非常大的阵列时,0填充将需要大量内存浪费。

Furthermore, this assumes the length of your array is exactly a power of 2, so that multiple halvings will result in exactly one element. For other size arrays, you may need to 0-pad your array. And here again, when working on very large arrays, 0-padding will require a lot of wasted memory.

因此​​,要解决这些问题,您可能希望你的数组分成的,也就是说,每64个元素多块。因此,如果你没有一个确切的数组长度,填充最后一个块多达64个不需要那么多的记忆。此外,您将需要更少的迭代步骤(和更少的空闲线程)减少64元。当然,64是一个神奇的数字我只是做了。请尝试其他2列强看到自己的成绩,你可能会看到与其他块效果更佳尺寸,如16或32.我怀疑性能与块大小将是非常依赖于硬件的。

So to fix these issues, you may want to split your array into multiple chunks of, say, 64 elements each. Therefore, if you don't have an exact array length, padding the "last" chunk up to 64 will not require that much memory. Also, you will need fewer iteration steps (and fewer idle threads) to reduce 64 elements. Of course, 64 is a magic number I just made up. Try other powers of 2 to see their results, you might see better results with other chunk sizes such as 16 or 32. I suspect performance vs. chunk size will be very hardware-dependent.

修改这假定RenderScript可以利用一个GPU驱动程序为,其中其上运行,因此,它实际上可以发起并行线程的更大数目的设备。否则,像这样的一个内核的纯CPU的执行很可能比处理数组lineary更慢。

This assumes that RenderScript can make use of a GPU driver for the device where its running on, so that it can actually launch a larger number of parallel threads. Otherwise, a CPU-only execution of a kernel like this would probably be even slower than processing the array lineary.

这篇关于计算在阵列中使用renderscript值的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆