为什么与通过int进行int相比,通过float进行乘法运算会更快? [英] Why is it faster to perform float by float matrix multiplication compared to int by int?

查看:513
本文介绍了为什么与通过int进行int相比,通过float进行乘法运算会更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于具有两个int矩阵A和B,具有1000多个行和10K列,我经常需要将它们转换为float矩阵以提高速度(4倍或更多).

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more).

我想知道为什么会这样吗?我意识到,浮点矩阵乘法有很多优化和矢量化功能,例如AVX等.但是,仍然有诸如整数的AVX2之类的指令(如果我没有记错的话).而且,不能将SSE和AVX用作整数吗?

I'm wondering why is this the case? I realize that there is a lot of optimization and vectorizations such as AVX, etc going on with float matrix multiplication. But yet, there are instructions such AVX2, for integers (if I'm not mistaken). And, can't one make use of SSE and AVX for integers?

为什么在矩阵代数库(例如Numpy或Eigen)下没有启发式算法来捕获此信息并像float一样更快地执行整数矩阵乘法?

Why isn't there a heuristic underneath matrix algebra libraries such as Numpy or Eigen to capture this and perform integer matrix multiplication faster just like float?

关于接受的答案:尽管@sascha的答案非常有用且相关,但@chatz的答案是int乘以int较慢的实际原因,而与BLAS整数矩阵运算是否存在无关.

About accepted answer: While @sascha's answer is very informative and relevant, @chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.

推荐答案

如果您编译这两个简单的函数,这些函数实际上只是计算乘积(使用Eigen库)

If you compile these two simple functions which essentially just calculate a product (using the Eigen library)

#include <Eigen/Core>

int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
    Eigen::MatrixXi C= A*B;
    return C(0,0);
}

int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
    Eigen::MatrixXf C= A*B;
    return C(0,0);
}

使用标志-mavx2 -S -O3,对于整数和浮点版本,您将看到非常相似的汇编代码. 但是,主要区别在于vpmulld具有2-3倍的延迟,而只有vmulps的吞吐量的1/2或1/4. (有关最新的英特尔架构)

using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version. The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)

参考:《英特尔内在技术指南》 ,吞吐量"是指相互的吞吐量,即,如果没有延迟发生(某种程度的简化),则每个操作使用了多少个时钟周期.

Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).

这篇关于为什么与通过int进行int相比,通过float进行乘法运算会更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆