CUDA-CUBLAS:解决许多(3x3)密集线性系统的问题 [英] CUDA - CUBLAS: issues solving many (3x3) dense linear systems

查看:117
本文介绍了CUDA-CUBLAS:解决许多(3x3)密集线性系统的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用CUDA 10.1解决大约1200000线性系统(3x3,Ax = B),尤其是使用CUBLAS库.我从此帖子[有用!]/a>,然后在统一内存版本中重新编写建议的代码.该算法首先使用cublasgetrfBatched()执行LU分解,然后连续两次调用cublastrsm()来求解上三角三角形系统或下三角三角形系统.该代码附在下面.它最多可以正确处理大约10000个矩阵,在这种情况下,执行LU分解(在NVIDIA GeForce 930MX上)大约需要570毫秒,而求解系统大约需要311毫秒.

I am trying to solve about 1200000 linear systems (3x3, Ax=B) with CUDA 10.1, in particular using the CUBLAS library. I took a cue from this (useful!) post and re-wrote the suggested code in a Unified Memory version. The algorithm firstly performs a LU factorization using cublasgetrfBatched() followed by two consecutive invocations of cublastrsm() which solves upper or lower triangular linear systems. The code is attached below. It works correctly up to about 10000 matrixes and, in this case, it takes ~570 ms to perform the LU factorization (on an NVIDIA GeForce 930MX) and ~311 ms to solve the systems.

我的问题是:

  1. 重载问题:为超过1万个矩阵分配内存时会崩溃.为什么?我该如何改善我的代码,以解决整个120万个矩阵?

  1. Overload issue: it crashes allocating memory for more than 10k matrices. Why? How can I improve my code in order to solve the whole batch of 1.2 million matrices?

时间问题:我的目标是在不到1秒的时间内解决所有这些系统.我目前是否遵循正确的方法?否则有其他建议吗?

Time issue: my goal would be to solve all of these systems in less than 1 second. Am I currently following the correct approach? Any suggestions otherwise?

是否可以使用和/或有用,如果可以的话,使用批次1万个矩阵的流"吗?

Would it be possible and/or useful, and if yes how, to use 'streams' of batches of 10k matrices?

代码:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <algorithm>
#include <cmath>
#include <iostream>
#include <vector>
#include <ctime>
#include <ratio>
#include <chrono>
#include <random>
#include <time.h>
#include <math.h>

// CUDA
#include <cuda.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <cusolverDn.h>

//#include "Utilities.cuh"

using namespace std;
using namespace std::chrono;

/************************************/
/* COEFFICIENT REARRANGING FUNCTION */
/************************************/
void rearrange(double** vec, int* pivotArray, int N, int numMatrices) {
  for (int nm = 0; nm < numMatrices; nm++) {
    for (int i = 0; i < N; i++) {
      double temp = vec[nm][i];
      vec[nm][i] = vec[nm][pivotArray[N*i + nm] - 1];
      vec[nm][pivotArray[N * i + nm] - 1] = temp;
    }
  }
}


/************************************/
/* MAIN  */
/************************************/
int main() {

  const int N = 3; 
  const int numMatrices = 10000; // I want 1200000

  // random generator to fill matrices and coefficients
  random_device device;
  mt19937 generator(device());
  uniform_real_distribution<double> distribution(1., 5.);

  //ALLOCATE MEMORY - using unified memory
  double** h_A;
  cudaMallocManaged(&h_A, sizeof(double*) * numMatrices);
  for (int nm = 0; nm < numMatrices; nm++) {
    cudaMallocManaged(&(h_A[nm]), sizeof(double) * N * N);
  }

  double** h_b;
  cudaMallocManaged(&h_b, sizeof(double*) * numMatrices);
  for (int nm = 0; nm < numMatrices; nm++) {
    cudaMallocManaged(&(h_b[nm]), sizeof(double) * N );
  }
  cout << " memory allocated" << endl;

  // FILL MATRICES
  for (int nm = 0; nm < numMatrices; nm++) {
    for (int i = 0; i < N; i++) {
      for (int j = 0; j < N; j++) {
        h_A[nm][j * N + i] = distribution(generator);
      }
    }
  }
  cout << " Matrix filled " << endl;

  // FILL COEFFICIENTS
  for (int nm = 0; nm < numMatrices; nm++) {
    for (int i = 0; i < N; i++) {
      h_b[nm][i] = distribution(generator);
    }
  }
  cout << " Coeff. vector filled " << endl;
  cout << endl;

  // --- CUDA solver initialization
  cublasHandle_t cublas_handle;
  cublasCreate_v2(&cublas_handle);
  int* PivotArray;
  cudaMallocManaged(&PivotArray, N * numMatrices * sizeof(int));
  int* infoArray;
  cudaMallocManaged(&infoArray, numMatrices * sizeof(int));

  //CUBLAS LU SOLVER
  high_resolution_clock::time_point t1 = high_resolution_clock::now();
  cublasDgetrfBatched(cublas_handle, N, h_A, N, PivotArray, infoArray, numMatrices);
  cudaDeviceSynchronize();
  high_resolution_clock::time_point t2 = high_resolution_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
  cout << "It took " << time_span.count() * 1000. << " milliseconds." << endl;


  for (int i = 0; i < numMatrices; i++)
    if (infoArray[i] != 0) {
      fprintf(stderr, "Factorization of matrix %d Failed: Matrix may be singular\n", i);
    }

 // rearrange coefficient 
 // (temporarily on CPU, this step will be on a GPU Kernel as well)
  high_resolution_clock::time_point tA = high_resolution_clock::now();
  rearrange(h_b, PivotArray, N, numMatrices);
  high_resolution_clock::time_point tB = high_resolution_clock::now();
  duration<double> time_spanA = duration_cast<duration<double>>(tB - tA);
  cout << "rearrangement took " << time_spanA.count() * 1000. << " milliseconds." << endl;

//INVERT UPPER AND LOWER TRIANGULAR MATRICES 
  // --- Function solves the triangular linear system with multiple right-hand sides
  // --- Function overrides b as a result 
  const double alpha = 1.f;
  high_resolution_clock::time_point t3 = high_resolution_clock::now();
  cublasDtrsmBatched(cublas_handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, CUBLAS_DIAG_UNIT, N, 1, &alpha, h_A, N, h_b, N, numMatrices);
  cublasDtrsmBatched(cublas_handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, N, 1, &alpha, h_A, N, h_b, N, numMatrices);
  cudaDeviceSynchronize();
  high_resolution_clock::time_point t4 = high_resolution_clock::now();
  duration<double> time_span2 = duration_cast<duration<double>>(t4 - t3);
  cout << "second step took " << time_span2.count() * 1000. << " milliseconds." << endl;
  
  // --- Free resources
  if (h_A) cudaFree(h_A);
  if (h_b) cudaFree(h_b);
 
  cudaDeviceReset();

  return 0;
}

推荐答案

重载问题:为超过1万个矩阵分配内存时会崩溃.为什么?我该如何改善我的代码以解决整个120万个矩阵?

Overload issue: it crashes allocating memory for more than 10k matrices. Why? How can I improve my code in order to solve the whole batch of 1.2 million matrices?

我认为,代码中的最大问题是,在这些关键分配循环中,您对托管内存的使用效率非常低:

In my opinion, the biggest problem in your code is that you are making horribly inefficient use of managed memory in these key allocation loops:

  //ALLOCATE MEMORY - using unified memory
  double** h_A;
  cudaMallocManaged(&h_A, sizeof(double*) * numMatrices);
  for (int nm = 0; nm < numMatrices; nm++) {
    cudaMallocManaged(&(h_A[nm]), sizeof(double) * N * N);
  }

  double** h_b;
  cudaMallocManaged(&h_b, sizeof(double*) * numMatrices);
  for (int nm = 0; nm < numMatrices; nm++) {
    cudaMallocManaged(&(h_b[nm]), sizeof(double) * N );
  }

问题在于,对 cudaMallocManaged 的每次调用都具有最小粒度.这意味着,如果您请求分配1个字节,则实际上会用掉大约4kbyte的内存(我相信这是linux的分配粒度.它看起来像是在Windows上,并且我相信Windows的分配粒度可能会更大).另外,当您启动内核时,这会在托管内存子系统上造成巨大的低效率数据传输负载(内核将在您的 cublas 调用中启动).

The problem is that each call to cudaMallocManaged has a minimum granularity. That means that if you request to allocate 1 byte, it will actually use up something like 4kbyte of memory (I believe that is the linux allocation granularity. It looks like you are on windows, and I believe the windows allocation granularity may be larger). In addition, this creates a huge inefficient data transfer load on the managed memory subsystem, when you launch a kernel (kernels will be launched in your cublas calls).

一种更好的方法是执行单个大分配,而不是循环分配,然后使用指针算法对该分配进行细分.代码看起来像这样:

A much better way to do this is to do a single large allocation, rather than the allocation-in-a-loop, and then just subdivide that allocation using pointer arithmetic. The code could look like this:

  //ALLOCATE MEMORY - using unified memory
  double** h_A;
  cudaMallocManaged(&h_A, sizeof(double*) * numMatrices);
  cudaMallocManaged(&(h_A[0]), sizeof(double)*numMatrices*N*N);
  for (int nm = 1; nm < numMatrices; nm++) {
    h_A[nm] = h_A[nm-1]+ N * N;
  }

  double** h_b;
  cudaMallocManaged(&h_b, sizeof(double*) * numMatrices);
  cudaMallocManaged(&(h_b[0]), sizeof(double) * numMatrices * N);
  for (int nm = 1; nm < numMatrices; nm++) {
    h_b[nm] = h_b[nm-1] + N;
  }

这样做的另一个好处是分配过程的运行速度要快得多.

Another benefit of this is that the allocation process runs quite a bit faster.

时间问题:我的目标是在不到1秒的时间内解决所有这些系统.我目前是否遵循正确的方法?否则有其他建议吗?

Time issue: my goal would be to solve all of these systems in less than 1 second. Am I currently following the correct approach? Any suggestions otherwise?

通过更改代码,我可以在1GB GPU(GeForce GT640)上成功运行,

With that change to your code, I am able to run successfully on a 1GB GPU (GeForce GT640), with:

const int numMatrices = 1200000;

具有这样的输出:

$ ./t81
 memory allocated
 Matrix filled
 Coeff. vector filled

It took 70.3032 milliseconds.
rearrangement took 60.02 milliseconds.
second step took 156.067 milliseconds.

您的GPU可能会稍慢一些,但我认为总体计时应该很容易在不到1秒的时间内出现.

Your GPU may be somewhat slower, but I think the overall timing should easily come in at less than 1 second.

使用一批1万个矩阵的流"是否可能和/或有用,如果可以的话,如何使用?

Would it be possible and/or useful, and if yes how, to use 'streams' of batches of 10k matrices?

有了上述更改,我认为您不必为此担心.在这里,流不会帮助计算操作重叠.它们可以帮助复制/计算重叠(尽管在您的GPU上可能不多),但这在带有托管内存的Windows上很难构建.对于Windows使用,如果您想探索

With the above change, I don't think you need to worry about this. Streams won't help here with overlap of compute operations. They could help with copy/compute overlap (although maybe not much on your GPU) but this would be hard to architect on windows with managed memory. For windows usage, I would probably suggest switching to ordinary CUDA separation of host and device memory, if you want to explore copy/compute overlap.

顺便说一句,您可能能够获得一组cublas调用,这些调用将通过使用直接反转来更快地完成工作.CUBLAS具有批处理直接反演方法.对于线性方程组的解决方案,我通常不建议这样做,但是对于一组3x3或4x4求逆,您可能需要考虑一些问题,您可以使用行列式方法轻松检查奇异性.此处是一个示例.

As an aside, you may be able to get a set of cublas calls that will do the work even more quickly by using direct inversion. CUBLAS has a batch direct inversion method. I normally wouldn't suggest this for solution of linear equations, but it may be something to consider for a set of 3x3 or 4x4 inversions, where you could easily check for singularity with the determinant method. Here is an example.

这篇关于CUDA-CUBLAS:解决许多(3x3)密集线性系统的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆