什么是 tf.bfloat16“截断的 16 位浮点数"? [英] What is tf.bfloat16 "truncated 16-bit floating point"?

查看:65
本文介绍了什么是 tf.bfloat16“截断的 16 位浮点数"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?

另外,量化整数"是什么意思?

解决方案

bfloat16 是一种特定于 tensorflow 的格式,不同于 IEEE 自己的 float16,因此有了新名称.b 代表 (Google) Brain.

基本上,bfloat16 是一个被截断为前 16 位的 float32.所以它的指数有相同的 8 位,尾数只有 7 位.因此很容易在 float32 之间转换,并且因为它与 float32 的范围基本相同,所以它最大限度地降低了 NaN 的风险从 float32 切换时的 s 或爆炸/消失梯度.

来自<来源/p><前>//浮点数的紧凑 16 位编码.此表示使用//1 位符号,8 位指数和 7 位尾数.它//假设浮点数采用 IEEE 754 格式,因此表示只是//单精度浮点数的 16-31 位.////注意:IEEE 浮点标准定义了一个 float16 格式//与这种格式不同(它的指数位数更少,而更多//尾数位).我们在这里不使用这种格式,因为转换//到/从 32 位浮点数对于该格式来说更复杂,并且//这种格式的转换非常简单.

至于量化整数,它们旨在替换受过训练的网络中的浮点数以加快处理速度.基本上,它们是实数的一种定点编码,尽管选择了一个操作范围来表示在网络的任何给定点观察到的分布.

有关量化的更多信息此处.

What is the difference between tf.float16 and tf.bfloat16 as listed in https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?

Also, what do they mean by "quantized integer"?

解决方案

bfloat16 is a tensorflow-specific format that is different from IEEE's own float16, hence the new name. The b stands for (Google) Brain.

Basically, bfloat16 is a float32 truncated to its first 16 bits. So it has the same 8 bits for exponent, and only 7 bits for mantissa. It is therefore easy to convert from and to float32, and because it has basically the same range as float32, it minimizes the risks of having NaNs or exploding/vanishing gradients when switching from float32.

From the sources:

// Compact 16-bit encoding of floating point numbers. This representation uses
// 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa.  It
// is assumed that floats are in IEEE 754 format so the representation is just
// bits 16-31 of a single precision float.
//
// NOTE: The IEEE floating point standard defines a float16 format that
// is different than this format (it has fewer bits of exponent and more
// bits of mantissa).  We don't use that format here because conversion
// to/from 32-bit floats is more complex for that format, and the
// conversion for this format is very simple.

As for quantized integers, they are designed to replace floating points in trained networks to speed up processing. Basically, they are a sort of fixed point encoding of real numbers, albeit with an operating range that is chosen to represent the observed distribution at any given point of the net.

More on quantization here.

这篇关于什么是 tf.bfloat16“截断的 16 位浮点数"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆