编译器生成代价高昂的 MOVZX 指令 [英] Compiler generates costly MOVZX instruction

查看:65
本文介绍了编译器生成代价高昂的 MOVZX 指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的分析器已将以下函数分析确定为热点.

My profiler has identified the following function profiling as the hotspot.

typedef unsigned short ushort;

bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
    for( unsigned i = 0; i < dim; ++i )
    {
        if( point[i + 1] >= coord[i] ) return false;
    }

    return true;  
}

特别是一个汇编指令 MOVZX(零扩展移动) 负责大部分运行时.if语句编译成

In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into

mov     rcx, QWORD PTR [rdi]
lea     r8d, [rax+1]
add     rsi, 2
movzx   r9d, WORD PTR [rsi-2]
mov     rax, r8
cmp     WORD PTR [rcx+r8*2], r9w
jae     .L5

我想劝说编译器不要生成这条指令,但我想我首先需要了解为什么会生成这条指令.考虑到我正在使用相同的数据类型,为什么要加宽/零扩展?

I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?

(在 godbolt 编译器资源管理器上查找整个函数.)

(Find the entire function on godbolt compiler explorer.)

推荐答案

movzx 指令零将数量扩展到更大尺寸的寄存器中.在您的情况下,一个字(两个字节)被零扩展为一个双字(四个字节).零扩展本身通常是免费的,较慢的部分是从 RAM 加载内存操作数 WORD PTR [rsi-2].

The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.

为了加快速度,您可以尝试确保要从 RAM 获取的数据在您需要时位于 L1 缓存中.您可以通过将战略预取内在函数放置在适当的位置来实现此目的.例如,假设一个缓存行是 64 字节,您可以在每次遍历循环时添加一个预取内部来获取数组条目 i + 32.

To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.

您还可以考虑对算法进行改进,以便从内存中获取更少的数据,但这似乎不太可能.

You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.

这篇关于编译器生成代价高昂的 MOVZX 指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆