System.Numerics.Vectors'Vector< T>':基本上只是System.UInt128吗? [英] System.Numerics.Vectors 'Vector<T>': is it basically just System.UInt128?

查看:119
本文介绍了System.Numerics.Vectors'Vector< T>':基本上只是System.UInt128吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在调查System.Numerics.Vectors 名称空间中的> Vector<T> . .0-preview1-26216-02"rel =" nofollow noreferrer> 4.5.0-preview1-26216-02 . MSDN文档说:

Vector<T>是一个不变的结构,代表指定数字类型的单个向量. Vector<T>实例的数量是固定的,但其上限取决于CPU寄存器.
count [sic.] of vector ",该句子似乎也不清楚,因为这暗示着不同的Vector<T>实例可能具有不同-尽管固定"到某个CPU限制-计数"(再次是所谓的计数",确切地说是什么?)这里没有提及实际的Count属性-或实际上在任何地方在简介页)上.

现在,通常情况下,我认为只读" 不可变" 比固定"更常用来描述实例的属性或字段,但是在这种情况下情况证明 Vector<T>.Count 属性虽然确实是只读的,但它也是 静态 ,因此与任何Vector<T> 实例都没有关联.取而代之的是,它的值仅根据通用类型参数T(然后如所示,可能在机器之间)变化:

bool hw = Vector.IsHardwareAccelerated;    // --> true

var c = (Vector<sbyte>.Count, Vector<short>.Count, Vector<int>.Count, Vector<long>.Count);
Debug.WriteLine(c);    // -->  (16, 8, 4, 2)

哦.

那么基本上是System.Int128伪装的吗?是吗?我的问题是:

  • 我想念什么吗?的确,我对SIMD几乎一无所知,但我认为该库将允许使用比硬件加速数据类型更广泛的 数据类型.只有128位.我的HPSG解析引擎通常会执行5,000+位的密集按位计算向量.
  • 再次假设我没有遗漏要点,为什么不直接将其命名为System.Int128/System.UInt128而不是Vector<T>?使用通用基元类型对其进行参数化确实带来了某些好处,但是给我一个错误的想法,即它是一个有用的扩展数组(即,可blittable元素T),而不是一个双倍宽度的CPU寄存器,在我看来,这大概就是标量".

    不要误会我的意思,一个128位寄存器是有趣,有用和令人兴奋的东西-如果这里有点超卖的话?例如,Vector<byte>无论如何都将拥有16个元素,无论您是否需要或全部使用它们,因此在运行时可能会因实例而异的Count精神似乎在这里被误用了. p>

  • 即使只有一个Vector<T>不能直接处理我所希望的用例,还是值得更新当前的实现(每个N-使用ulong[N >> 6]数组位向量)来代替使​​用Vector<ulong>[N >> 7]数组?

    ...是的,那是" Vector<ulong>的数组",这在我看来又很奇怪;名称中带有"Vector"的类型是否应该足够或有用地扩展,而不必显式创建用于包装多个实例的数组?

  • 除了每个128位SIMD逐位操作处理的数据量是原来的两倍外,SIMD逐位操作在每个操作码的周期上是否还更快(或更慢)?
  • 今天是否还有其他通用或可用性的硬件平台,其中System.Numerics.Vectors实际上报告了不同的SIMD位宽?

解决方案

向量大小并不总是16个字节,尽管这很常见.例如,在具有 AVX2 的平台上,以64位模式运行的程序将获得32字节向量.这样,通过在不同模式下运行程序,Count属性也可以在同一台计算机上(对于同一T)变化.原则上不必如此,即使只有AVX1支持,32位程序仍可以使用256位操作,但这不是 AVX-512 最多可扩展到512位,但是主流CPU上的 SIMD 暂时适用.

为什么不叫它System.Int128/System.UInt128

我希望这些类型映射到实际的整数类型,而不是向量类型.实际上,许多对128位整数有意义的操作都不作为CPU指令存在,并且 do 存在的几乎所有操作都在 2××64 (Vector<long>long[2]), 4×32 (Vector<int>int[4]), 8×16 (Vector<short>short[8])或 16 ×8 (Vector<byte>byte[16])位向量(或在支持它的平台上将其宽度加倍).在Int128上提供按字节添加"操作将很奇怪,而 not 提供真正的128位加法则使其变得更加陌生.除了前面提到的那样,大小不是定义的128位,这是很常见的.

许多SIMD操作都非常快,尽管有一些例外.例如,32位乘法通常具有相当大的延迟. System.Numerics.Vectors API还允许某些不存在的操作(必须缓慢模拟,例如整数除法或字节乘法),而不暗示存在问题.映射到实际存在的指令的操作通常很快.

虽然ulong上的按位运算也很快,但是从每单位时间完成的总工作量"的角度来看,它们的向量版本甚至更好.例如, Skylake 可以(最多)在每个周期执行四个标量按位运算(但是额外的操作(例如加法和比较/分支以形成循环)将在同一资源上竞争),但是使用SIMD执行三个256位按位运算,这是同时工作量的3倍,并且使执行端口保持打开状态用于标量运算或分支.

是的,可能值得使用.您可以保留ulong的数组,并使用Vector<T>的> construct-from-array 构造函数,这样您就不必在任何地方处理向量.例如,用可变索引索引到向量中根本不是一个好操作,这会导致分支,向量存储和标量重载.向量的可变大小性质显然也直接使用了它们的数组,而不是使用原始类型的数组,然后从它们中进行向量加载,显着复杂化了.不过,您可以轻松地将数组的长度四舍五入为向量计数的倍数,从而无需使用小标量循环来处理数组末尾不太适合向量的剩余项./p>

I'm looking into Vector<T> in the System.Numerics.Vectors namespace from version 4.5.0-preview1-26216-02. MSDN documentation says:

Vector<T> is an immutable structure that represents a single vector of a specified numeric type. The count of a Vector<T> instance is fixed, but its upper limit is CPU-register dependent.
https://docs.microsoft.com/en-us/dotnet/api/system.numerics.vector-1 (emphasis added)

Even overlooking the misguided wording "count [sic.] of a Vector", this sentence seems quite unclear, since it implies that different Vector<T> instances might have different--albeit "fixed" up to some CPU-limit--"counts" (again, so-called 'count' of what, exactly?) There's no mention of an actual Count property here--or in fact anywhere on the intro page).

Now normally, I think "read-only" or "immutable" are more conventionally used than "fixed" for describing an instance's properties or fields, but in this case it turns out that the Vector<T>.Count property, while indeed read-only, is also static, and thus in no-way associated with any Vector<T> instance. Instead, its value varies only according to the generic type argument T (and then presumably from machine-to-machine, as indicated):

bool hw = Vector.IsHardwareAccelerated;    // --> true

var c = (Vector<sbyte>.Count, Vector<short>.Count, Vector<int>.Count, Vector<long>.Count);
Debug.WriteLine(c);    // -->  (16, 8, 4, 2)

Oh.

So is it basically System.Int128 in disguise? And is that it? My questions are:

  • Am I missing something? It's true that I knew little nothing about SIMD, but I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits. My HPSG parsing engine routinely performs intensive bitwise computations vectors of 5,000+ bits.
  • Again assuming I'm not missing the point, why not just call it System.Int128 / System.UInt128 instead of Vector<T>? Parameterizing it with the generic primitive types does pose certain benefits, but gave me the wrong idea that it was more a usefully expansive array (i.e., namely, of blittable elements T), as opposed to just a double-width CPU register, which to my mind is about as "scalar" as you can get.

    Don't get me wrong, a 128-bit register is interesting, useful, and exciting stuff--if just a bit oversold here maybe? For example, Vector<byte> is going to have 16 elements no matter what, regardless of whether you need or use them all, so the spirit of Count which is expected to vary by instance at runtime seems to be misapplied here.

  • Even though a single Vector<T> won't directly handle the use case I described as I was hoping, would it be worth it to update my current implementation (which uses a ulong[N >> 6] array for each N-bit vector) to instead use the Vector<ulong>[N >> 7] array?

    ...yes, that's "array of Vector<ulong>", which again seems strange to me; shouldn't a type with "Vector" in its name be sufficiently or usefully extensible without having to explicitly create an array to wrap multiple instances?

  • Beyond the fact that each 128-bit SIMD bitwise operation is processing twice as much data, are SIMD bitwise operations additionally faster (or slower) in cycles per opcode as well?
  • Are there other hardware platforms in common use or availability today where System.Numerics.Vectors actually reports a different SIMD bit-width?

解决方案

The vector size is not always 16 bytes, though that is very common. For example on a platform with AVX2, programs run in 64bit mode get 32 byte vectors. In this way the Count property can also vary on the same machine (for the same T), by running the program in different modes. In principle it wouldn't have to be like that, a 32-bit program could still use 256-bit operations even with just AVX1 support, but that's not how System.Numerics.Vectors works. The varying size per feature-level of the CPU is a fairly fundamental part of the design of the API, presumably to enable some form of future-proofing, though it perhaps contributed to the lack of shuffles (which would be hard to specify for a vector of not-statically-known size).

I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits

That doesn't exist in the hardware so it would be hard to offer. AVX-512 goes up to 512 bits as the name implies, but that's as far as SIMD on mainstream CPUs goes for now.

why not just call it System.Int128 / System.UInt128

I would expect those types to map to actual integer types, not vector types. Many operations that would make sense on a 128-bit integers do not actually exist as CPU instructions, and almost all operations that do exist operate on 2 × 64 (Vector<long>, long[2]), 4 × 32 (Vector<int>, int[4]), 8 × 16 (Vector<short>, short[8]) or 16 × 8 (Vector<byte>, byte[16]) bit vectors (or double those widths on platforms that support it). Offering "byte-wise add" operation on an Int128 would be strange, and not offering true 128-bit addition makes it even stranger. Besides as mentioned earlier, the size is not by definition 128 bits, that's just common.

Many SIMD operations are quite fast, though there are some exceptions. 32-bit multiplication typically has a rather extreme latency for example. The System.Numerics.Vectors API also allows some non-existent operations (that must be emulated slowly, such as integer division or byte multiplication) without hinting that there is a problem. Operations that map to instructions that actually exist are mostly fast.

While bitwise operations on ulong are also fast, their vector versions are even better when viewed in terms of "total work done per unit time." For example, Skylake can execute (at best) four scalar bitwise operations per cycle (but extra operations like an addition and compare/branch to make a loop would compete over the same resource) but executes three 256-bit bitwise operations with SIMD which is 3 times the amount of work in the same time and it leaves an execution port open for a scalar operation or branch.

So yes, it's probably worth using. You could keep the array of ulong and use the construct-from-array constructor of Vector<T>, that way you don't have to deal with vectors everywhere. For example, indexing into a vector with a variable index is not a nice operation at all, causing a branch, vector store and scalar reload. The variable-size nature of the vectors obviously also significantly complicates using arrays of them directly rather than using array of primitive types and then vector-loading from them. You could easily round up the length of the array to a multiple of the vector count though, to remove the need for a small scalar loop to handle to remaining items at the end of an array that don't quite fit in a vector.

这篇关于System.Numerics.Vectors'Vector&lt; T&gt;':基本上只是System.UInt128吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆