任何人都可以提供示例代码在cuda中使用16位浮点的示例代码? [英] Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

查看:385
本文介绍了任何人都可以提供示例代码在cuda中使用16位浮点的示例代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cuda 7.5支持16位浮点变量。
任何人都可以提供示例代码演示它的使用吗?

Cuda 7.5 supports 16 bit floating point variables. Can anyone provide sample code demonstrating the use of it?

推荐答案

前面有几个要注意的事项:

There are a few things to note up-front:


  1. 请参阅半精度内在性

  2. 请注意,大多数或所有这些内在函数在设备代码中仅支持 。 (但是,@njuffa已创建一组主机可用的转换函数此处

  3. 请注意,计算能力5.2及以下的设备不 支持半精度算术。这意味着要执行的任何算术运算都必须在一些支持的类型上进行,例如 float 。计算能力5.3的设备(目前为Tegra TX1)和可能的未来设备将支持本地半精度算术运算,但是这些设备目前通过 __ hmul 。在不支持本机操作的设备中, __ hmul 的固有属性将被定义。

  4. cuda_fp16.h 在您打算在设备代码中使用这些类型和内在性的任何文件中。

  1. Refer to the half-precision intrinsics.
  2. Note that most or all of these intrinsics are only supported in device code. (However, @njuffa has created a set of host-usable conversion functions here)
  3. Note that devices of compute capability 5.2 and below do not natively support half-precision arithmetic. This means that any arithmetic operations to be performed must be done on some supported type, such as float. Devices of compute capability 5.3 (Tegra TX1, currently) and presumably future devices, will support "native" half-precision arithmetic operations, but these are currently exposed through such intrinsics as __hmul. An intrinsic like __hmul will be undefined in devices that do not support native operations.
  4. You should include cuda_fp16.h in any file where you intend to make use of these types and intrinsics in device code.

根据上述要点,这里有一个简单的代码,它需要一组 float 数量,将它们转换为 half 数量,并按比例缩放系数:

With the above points in mind, here is a simple code that takes a set of float quantities, converts them to half quantities, and scales them by a scale factor:

$ cat t924.cu
#include <stdio.h>
#include <cuda_fp16.h>
#define DSIZE 4
#define SCF 0.5f
#define nTPB 256
__global__ void half_scale_kernel(float *din, float *dout, int dsize){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < dsize){
    half scf = __float2half(SCF);
    half kin = __float2half(din[idx]);
    half kout;
#if __CUDA_ARCH__ >= 530
    kout = __hmul(kin, scf);
#else
    kout = __float2half(__half2float(kin)*__half2float(scf));
#endif
    dout[idx] = __half2float(kout);
    }
}

int main(){

  float *hin, *hout, *din, *dout;
  hin  = (float *)malloc(DSIZE*sizeof(float));
  hout = (float *)malloc(DSIZE*sizeof(float));
  for (int i = 0; i < DSIZE; i++) hin[i] = i;
  cudaMalloc(&din,  DSIZE*sizeof(float));
  cudaMalloc(&dout, DSIZE*sizeof(float));
  cudaMemcpy(din, hin, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
  half_scale_kernel<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(din, dout, DSIZE);
  cudaMemcpy(hout, dout, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < DSIZE; i++) printf("%f\n", hout[i]);
  return 0;
}

$ nvcc -o t924 t924.cu
$ cuda-memcheck ./t924
========= CUDA-MEMCHECK
0.000000
0.500000
1.000000
1.500000
========= ERROR SUMMARY: 0 errors
$

如果你研究上述代码,你会注意到,除了cc5.3和更高版本的设备,正在作为常规 float 操作。这与上面的注释3一致。

If you study the above code, you'll note that, except in the case of cc5.3 and higher devices, the arithmetic is being done as a regular float operation. This is consistent with the note 3 above.

外卖如下:


  1. 在cc5.2及以下版本的设备上, half 数据类型可能仍然有用,但主要作为存储优化(和相关地,也许是存储器带宽优化,因为例如给定的128位向量负载可以立即加载 8 一半量)。例如,如果你有一个大的神经网络,你已经确定权重可以容忍存储为半精度量(从而加倍存储密度,或大约加倍的神经网络的大小可以表示在GPU的存储空间),那么您可以将神经网络权重存储为半精度。然后,当你需要执行向前传递(推理)或向后传递(训练)时,你可以从内存加载权重,将它们(使用内在函数)转换为 float 数量,执行必要的操作(可能包括由于训练而调整重量),然后(如果需要)再次将重量存储为一半数量。

  2. 对于cc5.3和未来的设备,如果算法将容忍它,可以执行与上述类似的操作, code> float (也许回到 half ),而是将所有数据保留在 half 表示,并直接进行必要的算术(使用 __ hmul __ hadd intrinsics)。

  1. On devices of cc5.2 and below, the half datatype may still be useful, but principally as a storage optimization (and, relatedly, perhaps a memory bandwidth optimization, since e.g. a given 128-bit vector load could load 8 half quantities at once). For example, if you have a large neural network, and you've determined that the weights can tolerate being stored as half-precision quantities (thereby doubling the storage density, or approximately doubling the size of the neural network that can be represented in the storage space of a GPU), then you could store the neural network weights as half-precision. Then, when you need to perform a forward pass (inference) or a backward pass (training) you could load the weights in from memory, convert them on-the-fly (using the intrinsics) to float quantities, perform the necessary operation (perhaps including adjusting the weight due to training), then (if necessary) store the weight again as a half quantity.
  2. For cc5.3 and future devices, if the algorithm will tolerate it, it may be possible to perform a similar operation as above, but without conversion to float (and perhaps back to half), but rather leaving all data in half representation, and doing the necessary arithmetic directly (using e.g. __hmul or __hadd intrinsics).

虽然我没有在这里展示,但是一半数据类型是可用主机代码。通过这个,我的意思是你可以为该类型的项目分配存储,并执行例如。 cudaMemcpy 操作就可以了。但是主机代码不知道关于 half 数据类型的任何东西(例如,如何对其进行算术,或打印出来,或做类型转换),内在函数不是可用于主机代码。因此,你可以为一半数据类型的大数组分配存储空间,如果你想(可能存储一组神经网络权重),但你只能直接操作

Although I haven't demonstrated it here, the half datatype is "usable" in host code. By that, I mean you can allocate storage for items of that type, and perform e.g. cudaMemcpy operations on it. But the host code doesn't know anything about half data type (e.g. how to do arithmetic on it, or print it out, or do type conversions) and the intrinsics are not usable in host code. Therefore, you could certainly allocate storage for a large array of half data type if you wanted to (perhaps to store a set of neural network weights), but you could only directly manipulate that data with any ease from device code, not host code.

还有几个注释:


  1. CUBLAS库实现了设计为直接在 half 数据上工作的矩阵矩阵乘法。上述说明应该提供一些有关不同设备类型(即计算能力)可能发生的情况的信息。

  1. The CUBLAS library implements a matrix-matrix multiply designed to work directly on half data. The description above should give some insight as to what is likely going on "under the hood" for different device types (i.e. compute capabilities).

相关问题关于 half 的用法是此处

A related question about use of half in thrust is here.

这篇关于任何人都可以提供示例代码在cuda中使用16位浮点的示例代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆