取消引用 XMM 寄存器中的指针(收集) [英] Dereference pointers in XMM register (gather)

查看:69
本文介绍了取消引用 XMM 寄存器中的指针(收集)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我将一些指针或类似指针的值打包到 SSE 或 AVX 寄存器中,是否有任何特别有效的方法将它们取消引用,进入另一个这样的寄存器?(特别高效"的意思是比仅使用内存存储值更高效".)有没有任何方法可以在不将寄存器的中间副本写入内存的情况下取消对它们的引用?

If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory?

编辑澄清:这意味着,假设 32 位指针和 SSE,使用 XMM 寄存器的四个部分一次索引到四个任意内存区域,并将四个结果一次返回到另一个寄存器.或者尽可能接近同时".(/编辑)

Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with the four sections of an XMM register and return four results at once to another register. Or as close to "at once" as possible. (/edit)

Edit2:感谢 PaulR 的回答,我想我正在寻找的术语是收集",因此问题是在 AVX2 之前的系统中实现收集的最佳方法是什么?".

thanks to PaulR's answer I guess the terminology I'm looking for is "gather", and the question therefore is "what's the best way to implement gather for systems pre-AVX2?".

我认为没有这方面的说明,因为......好吧,据我所知,似乎不存在,无论如何它似乎根本不是 SSE 的设计目标.

I assume there isn't an instruction for this since ...well, one doesn't appear to exist as far as I can tell and anyway it doesn't seem to be what SSE is designed for at all.

("Pointer-like value" 意味着类似于数组的整数索引,假装是堆;机械上非常不同,但概念上是相同的.如果,比如说,一个人想要使用 32 位甚至 16 位值而不管本机指针大小,以便在寄存器中容纳更多值.)

("Pointer-like value" meaning something like an integer index into an array pretending to be the heap; mechanically very different but conceptually the same thing. If, say, one wanted to use 32-bit or even 16-bit values regardless of the native pointer size, to fit more values in a register.)

我能想到为什么要这样做的两个可能原因:

Two possible reason I can think of why one might want to do this:

  • 认为探索将 SSE 寄存器用于通用目的可能会很有趣...东西,也许有四个相同的线程"处理可能完全不相关/不连续的数据,垂直"切入寄存器" 而不是水平"(即而不是它们的设计使用方式).

  • thought it might be interesting to explore using the SSE registers for general-purpose... stuff, perhaps to have four identical 'threads' processing potentially completely unrelated/non-contiguous data, slicing through the registers "vertically" rather than "horizontally" (i.e. instead of the way they were designed to be used).

构建类似 romcc 之类的东西,如果出于某种原因(可能不是一个好一),不想向内存写入任何内容,因此需要更多的寄存器存储.

to build something like romcc if for some reason (probably not a good one), one didn't want to write anything to memory, and therefore would need more register storage.

这听起来像是 XY 问题,但事实并非如此,这只是出于好奇/愚蠢.拿到锤子我就去找钉子.

This might sound like an XY problem, but it isn't, it's just curiosity/stupidity. I'll go looking for nails once I have my hammer.

推荐答案

这个问题并不完全清楚,但是如果您想取消引用向量寄存器元素,那么在这里可能对您有帮助的唯一指令是 AVX2 的收集负载,例如_mm256_i32gather_epi32 .请参阅英特尔内部指南的 AVX2 部分.

The question is not entirely clear, but if you want to dereference vector register elements then the only instructions which might help you here are AVX2's gathered loads, e.g. _mm256_i32gather_epi32 et al. See the AVX2 section of the Intel Intrinsics Guide.

SYNOPSIS

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include "immintrin.h"
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flag : AVX2

DESCRIPTION

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

OPERATION

FOR j := 0 to 7
    i := j*32
    dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:256] := 0

这篇关于取消引用 XMM 寄存器中的指针(收集)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆