我可以/应该在GPU上运行此代码吗? [英] Can/Should I run this code on a GPU?

查看:96
本文介绍了我可以/应该在GPU上运行此代码吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个统计应用程序,该应用程序在数组中包含大约10-30百万个浮点值.

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.

几种方法在嵌套循环中对数组执行不同但独立的计算,例如:

Several methods performing different, but independent, calculations on the array in nested loops, for example:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

for (float x = 0f; x < 100f; x += 0.0001f) {
    int noOfOccurrences = 0;

    foreach (float y in largeFloatingPointArray) {
        if (x == y) {
            noOfOccurrences++;
        }
    }

    noOfNumbers.Add(x, noOfOccurrences);
}

当前应用程序是用C#编写的,在Intel CPU上运行,需要几个小时才能完成.我不了解GPU编程概念和API,所以我的问题是:

The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:

  • 是否有可能(并且有意义)利用GPU来加快计算速度?
  • 如果是:有人知道任何教程或没有任何示例代码(编程语言无关紧要)吗?

任何帮助将不胜感激.

推荐答案

更新 GPU版本

__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
    float y;                                         // compute one (or more) floats
    int noOfOccurrences = 0;
    int a;

    while( x < size )            // While there is work to do each thread will:
    {
        dictionary[x] = 0;       // Initialize the position in each it will work
        noOfOccurrences = 0;    

        for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
        {                                                     // that are equal 
                                                             // to it assign float
           y = largeFloatingPointArray[j];  // Take a candidate from the floats array 
           y *= 10000;                      // e.g if y = 0.0001f;
           a = y + 0.5;                     // a = 1 + 0.5 = 1;
           if (a == x) noOfOccurrences++;    
        }                                      

        dictionary[x] += noOfOccurrences; // Update in the dictionary 
                                          // the number of times that the float appears 

    x += blockDim.x * gridDim.x;  // Update the position here the thread will work
    }
}

我刚刚测试了较小的输入,因为我正在测试我的笔记本电脑.然而,它确实起作用了.但是,有必要做进一步的睾丸手术.

This one I just tested for smaller inputs, because I am testing I my laptop. Nevertheless, it did work. However, it necessary to do furthers testes.

更新顺序版本

我只是做过这个天真的版本,可以在不到20秒的时间内执行30,000,000次您的算法(已经具有生成数据的计数功能).

I just did this naive version that perform your algorithm for 30,000,000 in less than 20 seconds (already counting function to generate data).

基本上,它将对您的浮点数数组进行排序.它将在经过排序的数组上移动,分析一个值在数组中连续出现的次数,然后将此值连同它出现的次数一起放入字典中.

Basically, it sort your array of floats. It will travel over the sorted array, analyzing the number of times a value consecutively appears in the array and then put this value in a dictionary along with the number of times it appear.

您可以使用排序的地图,而不是我使用的unordered_map.

You can use sorted map, instead of the unordered_map that I used.

此处提供代码:

#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>


typedef std::tr1::unordered_map<float, int> Mymap;


void generator(float *data, long int size)
{
    float LO = 0.0;
    float HI = 100.0;

    for(long int i = 0; i < size; i++)
        data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}

void print_array(float *data, long int size)
{

    for(long int i = 2; i < size; i++)
        printf("%f\n",data[i]);

}

std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
    float previous = data[0];
    int count = 1;
    std::tr1::unordered_map<float, int> dict;

    for(long int i = 1; i < size; i++)
    {
        if(previous == data[i])
            count++;
        else
        {
          dict.insert(Mymap::value_type(previous,count));
          previous = data[i];
          count = 1;         
        }

    }
    dict.insert(Mymap::value_type(previous,count)); // add the last member
    return dict;

}

void printMAP(std::tr1::unordered_map<float, int> dict)
{
   for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
  {
     std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
   }
}


int main(int argc, char** argv)
{
  int size = 1000000; 
  if(argc > 1) size = atoi(argv[1]);
  printf("Size = %d",size);

  float data[size];
  using namespace __gnu_cxx;

  std::tr1::unordered_map<float, int> dict;

  generator(data,size);

  sort(data, data + size);
  dict = fill_dict(data,size);

  return 0;
}

如果在计算机中安装了库推力,则应使用此功能:

If you have the library thrust installed in you machine you should use this:

#include <thrust/sort.h>
thrust::sort(data, data + size);

代替此

sort(data, data + size);

可以肯定会更快.

原始帖子

我正在开发一个统计应用程序,它的大型数组包含10到3000万个浮点值".

"I'm working on a statistical application which has a large array containin 10 - 30 millions of floating point values".

是否有可能(并且有意义)利用GPU来加快计算速度?"

"Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?"

是的.一个月前,我将分子动力学模拟完全放在了GPU上.其中一个用于计算粒子对之间力的内核,每个接收6个数组,每个数组有500,000倍,总共300万倍(22 MB).

Yes, it is. A month ago I put a Molecular Dynamic simulation entirely on the GPU. One of the kernels, that calculates the force between pairs of particles, receive 6 array each one with 500,000 doubles, a total of 3 Millions doubles (22 MB).

因此,您计划放置3000万个浮点数,这大约是114 MB的全局内存,所以这不是问题,即使我的笔记本电脑也有250MB.

So you are planing to put 30 Millions of float points this is about 114 MB of global Memory, so this is not a problem, even my laptop have 250MB.

在您的情况下,计算数量可能会成为问题?根据我在分子动力学(MD)方面的经验,我不会.顺序的MD版本大约需要25个小时才能完成,而在GPU中则需要45分钟.您说您的应用程序花了几个小时,而且根据您的代码示例,它看起来比Molecular Dynamic更软.

The number of calculation can be a issue in your case? Based on my experience with the Molecular Dynamic (MD) I say no. The sequential MD version takes about 25 hours to complete while in GPU took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the Molecular Dynamic.

这是力计算示例:

__global__ void add(double *fx, double *fy, double *fz,
                    double *x, double *y, double *z,...){

     int pos = (threadIdx.x + blockIdx.x * blockDim.x); 

     ...

     while(pos < particles)
     {

      for (i = 0; i < particles; i++)
      {
              if(//inside of the same radius)
                {
                 // calculate force
                } 
       }
     pos += blockDim.x * gridDim.x;  
     }        
  }

一个简单的Cuda代码示例可以是两个2D数组的总和:

A simple example of a code in Cuda could be the sum of two 2D arrays:

在c中:

for(int i = 0; i < N; i++)
    c[i] = a[i] + b[i]; 

在Cuda:

__global__ add(int *c, int *a, int*b, int N)
{
  int pos = (threadIdx.x + blockIdx.x)
  for(; i < N; pos +=blockDim.x)
      c[pos] = a[pos] + b[pos];
}

在Cuda中,您基本上将每个线程用于迭代并除以每个线程,

In Cuda you basically took each for iteration and divide by each thread,

1) threadIdx.x + blockIdx.x*blockDim.x;

每个块的Id从0到N-1(N为最大块的数量),每个块都有X个线程,其ID从0到X-1.

Each block have a Id from 0 to N-1 (N the number maximum of blocks) and each block have a X number of threads with an id from 0 to X-1.

1)为您提供每个线程将基于其ID和该线程所在的块ID计算的for迭代,blockDim.x是一个块具有的线程数.

1) Gives you the for iteration that each thread will compute based on it id and the block id where the thread is in, the blockDim.x is the number of thread that a block have.

因此,如果您有2个块,每个块有10个线程,并且N = 40,则:

So if you have 2 blocks each one with 10 threads and a N = 40, the:

Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39

根据您的代码,我对cuda中的内容进行了草稿:

Looking to your code I made this draft of what could be it in cuda:

__global__ hash (float *largeFloatingPointArray, int *dictionary)
    // You can turn the dictionary in one array of int
    // here each position will represent the float
    // Since  x = 0f; x < 100f; x += 0.0001f
    // you can associate each x to different position
    // in the dictionary:

    // pos 0 have the same meaning as 0f;
    // pos 1 means float 0.0001f
    // pos 2 means float 0.0002f ect.
    // Then you use the int of each position 
    // to count how many times that "float" had appeared 


   int x = blockIdx.x;  // Each block will take a different x to work
    float y;

while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
    int noOfOccurrences = 0;
    float z = converting_int_to_float(x); // This function will convert the x to the
                                          // float like you use (x / 0.0001)

    // each thread of each block
    // will takes the y from the array of largeFloatingPointArray

    for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
    {
        y = largeFloatingPointArray[j];
        if (z == y)
        {
            noOfOccurrences++;
        }
    }
    if(threadIdx.x == 0) // Thread master will update the values
      atomicAdd(&dictionary[x], noOfOccurrences);
    __syncthreads();
}

您必须使用atomicAdd,因为来自不同块的不同线程可能会同时写入/读取noOfOccurrences,因此您必须不确定相互排斥.

You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences at the same time, so you have to unsure mutual exclusion.

这只是一种方法,您甚至可以将外循环的迭代赋予线程而不是块.

This is only one approach you can even give the iterations of the outer loop to the threads instead of the blocks.

教程

Dobbs博士杂志系列 CUDA:面向群众的超级计算非常出色,涵盖了几乎是十四期中的所有内容.它的启动也相当缓慢,因此非常适合初学者.

The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.

和另一个:

  • Developing With CUDA - Introduction
  • Volume I: Introduction to CUDA Programming
  • Getting started with CUDA
  • CUDA Resources List

看看最后一项,您会发现许多学习CUDA的链接.

Take a look on the last item, you will find many link to learn CUDA.

OpenCL: OpenCL教程| MacResearch

这篇关于我可以/应该在GPU上运行此代码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆