是否可以使用CUDA以便有效地计算已排序数组中元素的频率? [英] Is it possible to use CUDA in order to compute the frequency of elements inside a sorted array efficiently?

查看:54
本文介绍了是否可以使用CUDA以便有效地计算已排序数组中元素的频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Cuda并不陌生,我已经阅读了几本书,并在线阅读了许多教程.我已经在向量加法和乘法上实现了自己的实现.

我想进一步介绍一下,所以我们想实现一个将整数排序数组作为输入的函数.

我们的目标是找到数组中每个整数的频率.

为了产生输出,我们可以依次扫描阵列一次.时间复杂度为 O(n).

由于组别不同,我想必须可以利用CUDA.

假设这是数组

  11个1个1个2个2个335567 

为了获得完全的并行性,每个线程将必须确切地知道必须扫描数组的哪一部分才能找到总和.仅当我们使用另一个名为 int dataPosPerThread [] 的数组时,才能实现此目标,对于每个线程ID, dataPosPerThread [threadId] 的值都应为初始数组的起始位置.因此,这意味着每个线程都将知道从何处开始和从何处结束.

但是,以这种方式我们将一无所获,因为要找到职位,这需要我们花费 O(n)的时间.最终,总费用为 O(n)+ cost_to_transfer_the_data_to_the_gpu + O(c)+ cost_to_transfer_the_results_to_the_gpu 其中 O(c)是线程找到最终输出所花费的恒定时间,当然,假设我们在初始数组中有许多不同的整数.

我想避免额外的 O(n)费用.

到目前为止,我一直认为,具有大小为 arraySize 的数组,我们指定了将要使用的线程总数,例如 totalAmountOfThreads 每个线程将必须扫描 totalAmountOfThreads/arraySize 值.

第一个线程(id 0)将从位置0开始扫描,直到位置 totalAmountOfThreads/arraySize .

第二个线程将从 totalAmountOfThreads/arraySize + 1 开始,依此类推.

问题是,尽管某些线程可能正在使用不同的整数组,或者正在使用其他线程正在处理更多值的一组.例如,在上面的示例中,如果我们假设有6个线程,则每个线程将占用数组的2个整数,因此我们将具有以下内容:

  1< --------线程01个1< --------线程11个2< --------线程22个3< --------线程335< --------线程456< --------线程57 

如您所见,线程0仅具有 1 值,但是线程2正在处理其他 1 值.尽管如此,为了实现并行性,这些线程必须处理不相关的数据.假设我们将使用此逻辑,则每个线程将计算以下结果:

 线程0 =>{value = 1,total = 2}线程1 =>{value = 1,total = 2}线程2 =>{value = 2,total = 2}线程3 =>{value = 3,total = 2}线程4 =>{value = 5,total = 2}线程5 =>{{value = 6,total = 1},{value = 7,total = 1}} 

有了这个结果,可以进一步实现什么?有人建议使用额外的hash_map,例如 unordered_map ,它可以针对单个线程计算出的每个值有效地更新总变量.但是

    cuda编译器不支持
  1. Unordered_map

  2. 这将意味着线程将无法利用共享内存,因为来自不同块的两个线程可能使用相同的值,因此哈希映射必须位于全局内存中./p>

  3. 即使以上两个不是问题,在更新哈希映射时,我们在线程之间仍然存在竞争条件.

解决这个问题的好方法是什么?

提前谢谢

解决方案

正如@tera所指出的,您所描述的是直方图.

您可能对推力直方图示例代码感兴趣.如果以 dense_histogram()例程为例,您会注意到第一步是对数据进行排序.

所以,是的,对数据进行排序的事实将为您节省一步.

简而言之,我们是:

  1. 对数据进行排序
  2. 标记数据中不同元素的边界
  3. 计算边界之间的距离.

如示例代码所示,推力可以在一个函数中完成上述每个步骤.由于您的数据已排序,因此您可以有效地跳过第一步.

I'm very new to Cuda, I've read a few chapters from books and read a lot of tutorials online. I have made my own implementations on vector addition and multiplication.

I would like to move a little further, so let's say we want to implement a function that takes as an input a sorted array of integers.

Our goal is to find the frequencies of each integer that is in the array.

Sequentially we could scan the array one time in order to produce the output. The time complexity would be O(n).

Since the groups are different, I guess it must be possible to take advantage of CUDA.

Suppose this is the array

   1
   1
   1
   1
   2
   2
   3
   3
   5
   5
   6
   7

In order to achieve full parallelism, each thread would have to know exactly which part of the array it has to scan in order to find the sum. This can only be achieved if we use another array called int dataPosPerThread[] which for each thread id the dataPosPerThread[threadId] would have as value the starting position on the initial array. So, that would mean that each thread would know where to start and where to finish.

However in this way we won't gain anything, because it would take us O(n) time in order to find the positions. Eventually the total cost would be O(n) + cost_to_transfer_the_data_to_the_gpu + O(c) + cost_to_transfer_the_results_to_the_gpu where O(c) is the constant time it would take for the threads to find the final output, assuming of course that we have many different integers inside the initial array.

I would like to avoid the extra O(n) cost.

What I've thought so far is, having an array of size arraySize, we specify the total amount of threads that will be used, let's say totalAmountOfThreads which means that each thread will have to scan totalAmountOfThreads/arraySize values.

The first thread(id 0) would start scanning from position 0 until position totalAmountOfThreads/arraySize.

The second thread would start from totalAmountOfThreads/arraySize + 1 and so on.

The problem is though that some thread might be working with different integer groups or with one group that has more values being processed by other threads. For instance in the above example if we suppose that we will have 6 threads, each thread will take 2 integers of the array, so we will have something like this:

   1     <-------- thread 0
   1
   1     <-------- thread 1
   1
   2     <-------- thread 2
   2
   3     <-------- thread 3
   3
   5     <-------- thread 4
   5
   6     <-------- thread 5
   7

As you can see thread 0 has only 1 values, however there are other 1 values that are being processed by thread 2. In order to achieve parallelism though, these threads have to be working on unrelated data. Assuming that we will use this logic, each thread will compute the following results:

   thread 0 => {value=1, total=2}
   thread 1 => {value=1, total=2}
   thread 2 => {value=2, total=2}
   thread 3 => {value=3, total=2}
   thread 4 => {value=5, total=2}
   thread 5 => {{value=6, total=1}, {value=7, total=1}}

By having this result what can be further achieved? Someone could suggest using an extra hash_map, like unordered_map which can efficiently update for each value computed by a single thread the total variable. However

  1. Unordered_map is not supported by cuda compiler

  2. This would mean that the threads would not be able to take advantage of shared memory because two threads from different blocks could be working with the same values, so the hash map would have to be in the global memory.

  3. Even if the above two weren't a problem, we would still have race conditions between threads when updating the hash map.

What would be a good way in order to approach this problem?

Thank you in advance

解决方案

As @tera has already pointed out, what you're describing is a histogram.

You may be interested in the thrust histogram sample code. If we refer to the dense_histogram() routine as an example, you'll note the first step is to sort the data.

So, yes, the fact that your data is sorted will save you a step.

In a nutshell we are:

  1. sorting the data
  2. marking the boundaries of different elements within the data
  3. computing the distance between the boundaries.

As shown in the sample code, thrust can do each of the above steps in a single function. Since your data is sorted you can effectively skip the first step.

这篇关于是否可以使用CUDA以便有效地计算已排序数组中元素的频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆