我可以/应该在GPU上运行统计应用程序的代码吗? [英] Can/Should I run this code of a statistical application on a GPU?

查看:66
本文介绍了我可以/应该在GPU上运行统计应用程序的代码吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个统计应用程序,该应用程序在数组中包含大约10-30百万个浮点值.

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.

几种方法在嵌套循环中对数组执行不同但独立的计算,例如:

Several methods performing different, but independent, calculations on the array in nested loops, for example:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

for (float x = 0f; x < 100f; x += 0.0001f) {
    int noOfOccurrences = 0;

    foreach (float y in largeFloatingPointArray) {
        if (x == y) {
            noOfOccurrences++;
        }
    }
    noOfNumbers.Add(x, noOfOccurrences);
}

当前应用程序是用C#编写的,在Intel CPU上运行,需要几个小时才能完成.我不了解GPU编程概念和API,所以我的问题是:

The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:

  • 是否有可能(并且有意义)利用GPU来加快计算速度?
  • 如果是:有人知道任何教程或没有任何示例代码(编程语言无关紧要)吗?

推荐答案

更新 GPU版本

__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
    float y;                                         // compute one (or more) floats
    int noOfOccurrences = 0;
    int a;
    
    while( x < size )            // While there is work to do each thread will:
    {
        dictionary[x] = 0;       // Initialize the position in each it will work
        noOfOccurrences = 0;    

        for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
        {                                                     // that are equal 
                                                             // to it assign float
           y = largeFloatingPointArray[j];  // Take a candidate from the floats array 
           y *= 10000;                      // e.g if y = 0.0001f;
           a = y + 0.5;                     // a = 1 + 0.5 = 1;
           if (a == x) noOfOccurrences++;    
        }                                      
                                                    
        dictionary[x] += noOfOccurrences; // Update in the dictionary 
                                          // the number of times that the float appears 

    x += blockDim.x * gridDim.x;  // Update the position here the thread will work
    }
}

我刚刚测试了较小的输入,因为我正在笔记本电脑中进行测试.尽管如此,它仍在工作,但需要更多测试.

This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.

更新顺序版本

我只是做了这个天真的版本,可以在不到20秒的时间内(包括生成数据的函数所花费的时间)对具有30,000,000个元素的数组执行算法.

I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).

此天真的版本会首先对您的浮点数组进行排序.然后,将遍历已排序的数组,并检查给定 value 在数组中出现的次数,然后将该值连同出现的次数放入字典中.

This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.

您可以使用 sorted 映射,而不是我使用的 unordered_map .

You can use sorted map, instead of the unordered_map that I used.

此处提供代码:

#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>


typedef std::tr1::unordered_map<float, int> Mymap;


void generator(float *data, long int size)
{
    float LO = 0.0;
    float HI = 100.0;
    
    for(long int i = 0; i < size; i++)
        data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}

void print_array(float *data, long int size)
{

    for(long int i = 2; i < size; i++)
        printf("%f\n",data[i]);
    
}

std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
    float previous = data[0];
    int count = 1;
    std::tr1::unordered_map<float, int> dict;
    
    for(long int i = 1; i < size; i++)
    {
        if(previous == data[i])
            count++;
        else
        {
          dict.insert(Mymap::value_type(previous,count));
          previous = data[i];
          count = 1;         
        }
        
    }
    dict.insert(Mymap::value_type(previous,count)); // add the last member
    return dict;
    
}

void printMAP(std::tr1::unordered_map<float, int> dict)
{
   for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
  {
     std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
   }
}


int main(int argc, char** argv)
{
  int size = 1000000; 
  if(argc > 1) size = atoi(argv[1]);
  printf("Size = %d",size);
  
  float data[size];
  using namespace __gnu_cxx;
  
  std::tr1::unordered_map<float, int> dict;
  
  generator(data,size);
  
  sort(data, data + size);
  dict = fill_dict(data,size);
  
  return 0;
}

如果您在机器中安装了库推力,则应使用此功能:

If you have the library thrust installed in you machine your should use this:

#include <thrust/sort.h>
thrust::sort(data, data + size);

代替此

sort(data, data + size);

可以肯定会更快.

原始帖子

我正在处理具有大量数据的统计应用程序包含10-30百万个浮点值.

I'm working on a statistical application which has a large array containing 10 - 30 millions of floating point values.

是否有可能(并且有意义)利用GPU来加速这样的计算?

Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?

是的.一个月前,我在GPU上进行了完全的分子动力学仿真.其中一个计算粒子对之间力的内核作为参数 6 数组接收,每个以 500,000 倍增,总计为 3 数百万加倍(22 MB).

Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).

因此,如果您打算放置 30 百万个浮点数(大约占全局内存 114 MB ),那将不是问题.

So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.

在您的情况下,计算数量会成为问题吗?根据我在分子动力学(MD)方面的经验,我会拒绝.顺序MD版本需要大约 25 小时才能完成,而GPU版本则需要 45 分钟.您说您的应用程序花了几个小时,而且根据您的代码示例,它看上去比MD还要更软.

In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.

这是力计算示例:

__global__ void add(double *fx, double *fy, double *fz,
                    double *x, double *y, double *z,...){
   
     int pos = (threadIdx.x + blockIdx.x * blockDim.x); 
      
     ...
     
     while(pos < particles)
     {
     
      for (i = 0; i < particles; i++)
      {
              if(//inside of the same radius)
                {
                 // calculate force
                } 
       }
     pos += blockDim.x * gridDim.x;  
     }        
  }

CUDA中一个简单的代码示例可以是两个2D数组的总和:

A simple example of a code in CUDA could be the sum of two 2D arrays:

在C中:

for(int i = 0; i < N; i++)
    c[i] = a[i] + b[i]; 

在CUDA中:

__global__ add(int *c, int *a, int*b, int N)
{
  int pos = (threadIdx.x + blockIdx.x)
  for(; i < N; pos +=blockDim.x)
      c[pos] = a[pos] + b[pos];
}

在CUDA中,您基本上将每个用于迭代并分配给每个线程,

In CUDA you basically took each for iteration and assigned to each thread,

1) threadIdx.x + blockIdx.x*blockDim.x;

每个块都有一个 ID ,从 0 N-1 (N个最大块数),每个块都有一个'X'具有 ID 0 X-1 的线程数.

Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.

  1. 为您提供每个线程将基于其 ID 和该线程所在的块 ID 进行计算的 for 循环迭代;blockDim.x是一个块具有的线程数.
  1. Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.

因此,如果每个块有2个块,每个块具有 10 个线程和 N = 40 ,则:

So if you have 2 blocks each one with 10 threads and N=40, the:

Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39

看看您当前的代码,我已经草拟了您的代码在CUDA中的外观草案:

Looking at your current code, I have made this draft of what your code could look like in CUDA:

__global__ hash (float *largeFloatingPointArray, int *dictionary)
    // You can turn the dictionary in one array of int
    // here each position will represent the float
    // Since  x = 0f; x < 100f; x += 0.0001f
    // you can associate each x to different position
    // in the dictionary:

    // pos 0 have the same meaning as 0f;
    // pos 1 means float 0.0001f
    // pos 2 means float 0.0002f ect.
    // Then you use the int of each position 
    // to count how many times that "float" had appeared 


   int x = blockIdx.x;  // Each block will take a different x to work
    float y;
    
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
    int noOfOccurrences = 0;
    float z = converting_int_to_float(x); // This function will convert the x to the
                                          // float like you use (x / 0.0001)

    // each thread of each block
    // will takes the y from the array of largeFloatingPointArray
    
    for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
    {
        y = largeFloatingPointArray[j];
        if (z == y)
        {
            noOfOccurrences++;
        }
    }
    if(threadIdx.x == 0) // Thread master will update the values
      atomicAdd(&dictionary[x], noOfOccurrences);
    __syncthreads();
}

您必须使用 atomicAdd ,因为来自不同块的不同线程可能同时写入/读取 noOfOccurrences ,因此必须确保

You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.

这只是一种方法;您甚至可以将外循环的迭代分配给线程而不是块.

This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.

教程

Dobbs博士杂志系列 CUDA:面向群众的超级计算非常出色,涵盖了几乎是十四期中的所有内容.它的启动也相当缓慢,因此非常适合初学者.

The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.

及其他:

看看最后一项,您会发现许多学习CUDA的链接.

Take a look on the last item, you will find many link to learn CUDA.

OpenCL: OpenCL教程|MacResearch

这篇关于我可以/应该在GPU上运行统计应用程序的代码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆