任何人都可以提供示例代码来演示在 cuda 中使用 16 位浮点数吗? [英] Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

查看:15
本文介绍了任何人都可以提供示例代码来演示在 cuda 中使用 16 位浮点数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cuda 7.5 支持 16 位浮点变量.任何人都可以提供示例代码来演示它的用法吗?

Cuda 7.5 supports 16 bit floating point variables. Can anyone provide sample code demonstrating the use of it?

推荐答案

有几点需要预先注意:

  1. 参考半精度intrinsics.
  2. 请注意,其中许多内在函数在设备代码中受支持.但是,在最近/当前的 CUDA 版本中,许多/大多数 主机和设备代码都支持转换内在函数.(而且,@njuffa 创建了一组主机可用的转换函数 here) 因此,即使下面的代码示例显示设备代码中的转换,相同类型的转换和内在函数(half->float, float->half) 以相同的方式在主机代码中可用和支持.
  3. 请注意,计算能力为 5.2 及以下的设备本机不支持半精度算术.这意味着要执行的任何算术运算都必须在某些受支持的类型上完成,例如 float.计算能力为 5.3 的设备(当前为 Tegra TX1)和可能的未来设备将支持原生"设备.半精度算术运算,但这些目前通过 __hmul 等内在函数公开.在不支持本机操作的设备中,像 __hmul 这样的内部函数将是 undefined.
  4. 您应该在您打算在设备代码中使用这些类型和内在函数的任何文件中包含 cuda_fp16.h.
  5. half2 数据类型(向量类型)确实是压缩/批量半存储(例如向量或矩阵)的首选形式,因此您可能希望使用相关的half2 转换函数.
  1. Refer to the half-precision intrinsics.
  2. Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, even though the code sample below shows conversion in device code, the same types of conversions and intrinsics (half->float, float->half) are usable and supported in host code in the same way.
  3. Note that devices of compute capability 5.2 and below do not natively support half-precision arithmetic. This means that any arithmetic operations to be performed must be done on some supported type, such as float. Devices of compute capability 5.3 (Tegra TX1, currently) and presumably future devices, will support "native" half-precision arithmetic operations, but these are currently exposed through such intrinsics as __hmul. An intrinsic like __hmul will be undefined in devices that do not support native operations.
  4. You should include cuda_fp16.h in any file where you intend to make use of these types and intrinsics in device code.
  5. The half2 data type (a vector type) is really the preferred form for condensed/bulk half storage (such as in a vector or matrix), so you may want to use the relevanthalf2 conversion functions.

考虑到以上几点,这里有一个简单的代码,它采用一组 float 量,将它们转换为 half 量,并按比例因子对其进行缩放:

With the above points in mind, here is a simple code that takes a set of float quantities, converts them to half quantities, and scales them by a scale factor:

$ cat t924.cu
#include <stdio.h>
#include <cuda_fp16.h>
#define DSIZE 4
#define SCF 0.5f
#define nTPB 256
__global__ void half_scale_kernel(float *din, float *dout, int dsize){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < dsize){
    half scf = __float2half(SCF);
    half kin = __float2half(din[idx]);
    half kout;
#if __CUDA_ARCH__ >= 530
    kout = __hmul(kin, scf);
#else
    kout = __float2half(__half2float(kin)*__half2float(scf));
#endif
    dout[idx] = __half2float(kout);
    }
}

int main(){

  float *hin, *hout, *din, *dout;
  hin  = (float *)malloc(DSIZE*sizeof(float));
  hout = (float *)malloc(DSIZE*sizeof(float));
  for (int i = 0; i < DSIZE; i++) hin[i] = i;
  cudaMalloc(&din,  DSIZE*sizeof(float));
  cudaMalloc(&dout, DSIZE*sizeof(float));
  cudaMemcpy(din, hin, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
  half_scale_kernel<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(din, dout, DSIZE);
  cudaMemcpy(hout, dout, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < DSIZE; i++) printf("%f
", hout[i]);
  return 0;
}

$ nvcc -o t924 t924.cu
$ cuda-memcheck ./t924
========= CUDA-MEMCHECK
0.000000
0.500000
1.000000
1.500000
========= ERROR SUMMARY: 0 errors
$

如果你研究了上面的代码,你会注意到,除了 cc5.3 和更高版本的设备,算术是作为一个常规的 float 操作.这与上面的注 3 一致.

If you study the above code, you'll note that, except in the case of cc5.3 and higher devices, the arithmetic is being done as a regular float operation. This is consistent with the note 3 above.

要点如下:

  1. 在 cc5.2 及更低版本的设备上,half 数据类型可能仍然有用,但主要用作存储优化(以及相关的内存带宽优化), 因为例如给定的 128 位向量加载可以一次加载 8 half 个数量).例如,如果您有一个大型神经网络,并且您已经确定权重可以被存储为半精度量(从而使存储密度增加一倍,或者大约使神经网络的大小可以在GPU 的存储空间),那么您可以将神经网络权重存储为半精度.然后,当您需要执行前向传递(推理)或后向传递(训练)时,您可以从内存中加载权重,将它们即时(使用内在函数)转换为 float数量,执行必要的操作(可能包括由于训练而调整权重),然后(如有必要)将权重再次存储为 half 数量.
  2. 对于 cc5.3 及以后的设备,如果算法可以容忍,可能可以执行与上述类似的操作,但无需转换为 float (也许回到 half),而是将所有数据保留在 half 表示中,并直接进行必要的算术运算(例如使用 __hmul__hadd 内在函数).
  1. On devices of cc5.2 and below, the half datatype may still be useful, but principally as a storage optimization (and, relatedly, perhaps a memory bandwidth optimization, since e.g. a given 128-bit vector load could load 8 half quantities at once). For example, if you have a large neural network, and you've determined that the weights can tolerate being stored as half-precision quantities (thereby doubling the storage density, or approximately doubling the size of the neural network that can be represented in the storage space of a GPU), then you could store the neural network weights as half-precision. Then, when you need to perform a forward pass (inference) or a backward pass (training) you could load the weights in from memory, convert them on-the-fly (using the intrinsics) to float quantities, perform the necessary operation (perhaps including adjusting the weight due to training), then (if necessary) store the weight again as a half quantity.
  2. For cc5.3 and future devices, if the algorithm will tolerate it, it may be possible to perform a similar operation as above, but without conversion to float (and perhaps back to half), but rather leaving all data in half representation, and doing the necessary arithmetic directly (using e.g. __hmul or __hadd intrinsics).

虽然我没有在这里演示,但 half 数据类型是可用的";在主机代码中.我的意思是,您可以为该类型的项目分配存储空间,并执行例如cudaMemcpy 操作就可以了.但是宿主代码对 half 数据类型一无所知(例如,如何对其进行算术运算,或将其打印出来),例如 算术内在函数 在主机代码中不可用.因此,如果您愿意,您当然可以为大量 half(或者可能是 half2)数据类型分配存储空间(也许存储一组神经网络权重),但是您只能通过设备代码而不是主机代码轻松地直接操作该数据.

Although I haven't demonstrated it here, the half datatype is "usable" in host code. By that, I mean you can allocate storage for items of that type, and perform e.g. cudaMemcpy operations on it. But the host code doesn't know anything about half data type (e.g. how to do arithmetic on it, or print it out) and for example the arithmetic intrinsics are not usable in host code. Therefore, you could certainly allocate storage for a large array of half (or probably half2) data type if you wanted to (perhaps to store a set of neural network weights), but you could only directly manipulate that data with any ease from device code, not host code.

还有一些评论:

  1. CUBLAS 库实现一个矩阵-矩阵乘法,旨在直接处理 half 数据.上面的描述应该给出一些关于幕后"可能发生的事情的见解.适用于不同的设备类型(即计算能力).

  1. The CUBLAS library implements a matrix-matrix multiply designed to work directly on half data. The description above should give some insight as to what is likely going on "under the hood" for different device types (i.e. compute capabilities).

关于在推力中使用 half 的相关问题是 这里.

A related question about use of half in thrust is here.

这篇关于任何人都可以提供示例代码来演示在 cuda 中使用 16 位浮点数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆