如何使用SIMD实现atoi？ [英] How to implement atoi using SIMD?

查看：250 发布时间：2016/10/19 19:59:47 c++ x86 sse simd atoi

本文介绍了如何使用SIMD实现atoi？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想尝试使用SIMD说明书写一个atoi实现，以包含在 RapidJSON （一个C ++ JSON读/写库）。目前在其他地方有一些SSE2和SSE4.2优化。

如果是速度增益，多个 atoi 结果可以并行完成。字符串最初来自JSON数据的缓冲区，因此多atoi函数将必须执行任何所需的swizzling。

我想出的算法是以下：

我可以用以下方式初始化长度为N的向量：
[10 ^ N..10 ^ 1]

我将缓冲区中的每个字符转换为整数，并将它们放在另一个向量中。

我将有效数字向量中的每个数字乘它由数字中的匹配数向量和求和结果。

我的目标是x86和x86-64架构。

知道AVX2支持三个操作数Fused Multiply-Add，所以我可以执行Sum = Number * Significant + Sum。

这是我到目前为止的地方。

是我的算法正确吗？有更好的方法吗？

有一个参考实现atoi使用任何SIMD指令集？

解决方案

算法及其实现现已完成。

此属性已完成且已（适度）测试（为较少的常量内存使用和容忍加号更新）代码如下：

适用于 int 和 uint ，
从 MIN_INT = -2147483648 到 MAX_INT = 2147483647 和
从 MIN_UINT = 0 到 MAX_UINT = 4294967295

一个领先的' - ' char表示负数（明智），领先的'+'忽略

忽略前导零（有或无符号字符）

忽略溢出 - 更大的数字只是回绕

零长度字符串导致值 0 = -0

无效字符被识别，转换结束于第一个无效字符

最后一个前导零之后的至少16个字节必须可访问，并且在EOS之后读取的可能安全隐患留给调用者

只有SSE4。 2 关于此实施：
ul>
此代码示例可以使用GNU Assembler（ as ）使用 .intel_syntax noprefix 开头

常量的数据占用量为64字节（4 * 128位XMM），等于一个高速缓存行。

代码占用空间为46个指令具有51微秒和64个周期的延迟

一个循环用于删除前导零，否则没有跳转，除了错误处理，... ...

时间复杂度为O（1）

算法的方法：

   -  ESI中的数字字符串指针
  - 检查第一个字符是否为' - '，然后指示EDX中的负数* A **）
  - 检查前导零和EOS（** B **）
  - 检查字符串的有效数字，并获取有效字符的strlen（）（** C **）
  - 反向字符串，使得
的功率10 ^ 0始终为BYTE 15 
 10 ^ 1始终为BYTE 14 
 10 ^ 2始终为BYTE 13 
 10 ^ 3总是在BYTE 12 
 10 ^ 4总是在BYTE 11 
 ... 
并且掩蔽掉所有后面的字符（** D **）
  - 减去饱和16个可能的字符（** 1 **）中的每一个都为'0'
  - 取16个连续的字节值，并将它们分成两个XMM寄存器（** 2 **）中的WORD 
 
 PONMLKJI | H G F E D C B A  - > 
 H G F E | D B B A（XMM0）
 P O N M | LKJI（XMM1）
  - 乘以每个WORD的位置值模数10000（1,10,100,1000）
（因子小于MAX_WORD，每个QWORD / halfXMM有4个因子）
 2 **），所以我们可以在另一个乘法之前水平合并两次。 
 PMADDWD指令可以执行此操作和下一步：
  - 将相邻的WORD水平添加到DWORD（** 3 **）
 H * 1000 + G * 100 F * 10 + E * 1 | D * 1000 + C * 100 B * 10 + A * 1（XMM0）
 P * 1000 + O * 100 N * 10 + M * L * 1000 + K * 100 J * 10 + I * 1（XMM1）
  - 将从XMM0和XMM1到XMM0（** 4 **）的相邻DWORD水平添加
 xmmDst [31-0] xmm0 [63-32] + xmm0 [31-0] 
 xmmDst [63-32] = xmm0 [127-96] + xmm0 [95-64] 
 xmmDst [95-64] = xmm1 [ xmm1 [127-96] + xmm1 [93-32] + xmm1 [31-0] 
 xmmDst [127-96] + xmm1 [95-64] 
  -  XMM0中的值乘以因子* 5 **）
 P * 1000 + O * 100 + N * 10 + M * 1（DWORD因子1000000000000 =对于DWORD太大，但可能对QWORD数字字符串有用）
 L * 1000 + K * 100 + J * 10 + I * 1（DWORD因子100000000）
 H * 1000 + G * 100 + F * 10 + E * 1（DWORD因子10000）
 D * 1000 + C * 100 + B * 10 + A * 1（DWORD因子1）
  - 最后一步是将这四个DWORD与2 * PHADDD一起添加2 *（PSHUFD + PADDD）
  -  xmm0 [31- 0] = xmm0 [63-32] + xmm0 [31-0]（** 6 **）
 xmm0 [63-32] = xmm0 [127-96] + xmm0 [95-64] 
（上QWORD包含相同并被忽略）
  -  xmm0 [31-0] = xmm0 [63-32] + xmm0 [31-0]（** 7 **）
  - 如果数字为负（在EDX中由000表示... 0 = pos或111 ... 1 = neg），取反（** 8 **）

使用intel语法的GNU Assembler中的示例实现：

  .intel_syntax noprefix 
 .data 
 .align 64 
 ddqDigitRange：.byte'0'，'9'，0,0,0,0,0,0， 0,0,0,0,0,0,0,0 
 ddqShuffleMask：.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2 ，1,0 
 ddqFactor1：.word 1,10,100,1000，1,10,100,1000 
 ddqFactor2：.long 1,10000,100000000,0 
 .text 
 _start： 
 mov esi，lpInputNumberString 
 / *（** A **）表示EDX中的负数* / 
 mov eax，-1 
 xor ecx，ecx 
 xor edx，edx 
 mov bl，byte ptr [esi] 
 cmp bl，' - '
 cmove edx，eax 
 cmp bl，'+'
 cmove ecx， eax 
 sub esi，edx 
 sub esi，ecx 
 / *（** B **）删除前导零* / 
 xor eax，eax / *返回值ZERO * 
 remove_leading_zeros：
 inc esi 
 cmp byte ptr [esi-1]，'0'/ *跳过前导零* / 
 je remove_leading_zeros 
 cmp byte ptr [esi -1]，0 / * catch空字符串/数字* / 
 je FINISH 
 dec esi 
 / *检查有效的数字字符并从前向后翻转* / 
 pxor xmm2，xmm2 
 movdqa xmm0，xmmword ptr [ddqDigitRange] 
 movdqu xmm1，xmmword ptr [esi] 
 pcmpistri xmm0，xmm1,0b00010100 / *（** C **）iim8 =字节，范围，负极性（ - ），返回ECX中的strlen（）* / 
 jo FINISH / *如果第一个字符无效返回0  - 防止处理空字符串 -  0仍然在EAX * / 
 mov al，'0'/ *从chars中减去的值* / 
子ecx，16 / * 16-len =阴影到零为shuffle mask * / 
 movd xmm0，ecx 
 pshufb xmm0，xmm2 / *向所有16个字节广播CL * / 
 paddb xmm0，xmmword ptr [ddqShuffleMask] / *生成PSHUFB的置换掩码 - 0具有最高位集合意味着位置被置零* / 
 pshufb xmm1，xmm0 / *（** D **）permute  - 现在从最高到最低BYTE是因子10 ^ 0,10 ^ 1,10 ^ 2， ... * / 
 movd xmm0，eax / * AL =从上面的'0'* / 
 pshufb xmm0，xmm2 / *广播AL到XMM0 * / 
 psubusb xmm1，xmm0 / （** 1 **）* / 
 movdqa xmm0，xmm1 
 punpcklbw xmm0，xmm2 / *（** 2 **）* / 
 punpckhbw xmm1，xmm2 
 pmaddwd xmm0 ，xmmword ptr [ddqFactor1] / *（** 3 **）* / 
 pmaddwd xmm1，xmmword ptr [ddqFactor1] 
 phaddd xmm0，xmm1 / *（** 4 **）* / 
 pmulld xmm0，xmmword ptr [ddqFactor2] / *（** 5 **）* / 
 pshufd xmm1，xmm0，0b11101110 / *（** 6 **）* / 
 paddd xmm0，xmm1 
 pshufd xmm1，xmm0，0b01010101 / *（** 7 **）* / 
 paddd xmm0，xmm1 
 movd eax，xmm0 
 / *如果负数则取反* / 
 add eax，edx / *（** 8 **）* / 
 xor eax，edx 
 FINISH：
 / * EAX是return（u）int value * /

Haswell 32位的Intel-IACA吞吐量分析结果： p>

 吞吐量分析报告
 --------------------- ----- 
块吞吐量：16.10循环吞吐量瓶颈：Interlteration 
 
循环中的端口绑定每次迭代：
 ------------- -------------------------------------------------- ------------------------ 
 | Port | 0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 | 
 ---------------------------------------------- ----------------------------------------- 
 |周期| 9.5 0.0 | 10.0 | 4.5 4.5 | 4.5 4.5 | 0.0 | 11.1 | 11.4 | 0.0 | 
 ---------------------------------------------- ----------------------------------------- 
 
 N - 端口号或周期数资源冲突引起的延迟，DV  - 分频器管道（在端口0上）
 D  - 关键路径上的数据提取管道（在端口2和3上），CP  -  
 F  - 与前一个指令发生的宏融合
 *  - 指令微操作未绑定到端口
 ^  -  Micro Fusion发生
＃ -  ESP跟踪同步uop已发出
 @  -  SSE指令后跟一个AVX256指令，几十个周期的惩罚预期
！ - 不支持指令，未分析
 
 | Num Of |循环中的端口压力| | 
 | Uops | 0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 | | 
 ---------------------------------------------- ----------------------------------- 
 | 0 * | | | | | | | | | | xor eax，eax 
 | 0 * | | | | | | | | | | xor ecx，ecx 
 | 0 * | | | | | | | | | | xor edx，edx 
 | 1 | | 0.1 | | | | | 0.9 | | | dec eax 
 | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov bl，byte ptr [esi] 
 | 1 | | | | | | | 1.0 | | CP | cmp bl，0x2d 
 | 2 | 0.1 | 0.2 | | | | | 1.8 | | CP | cmovz edx，eax 
 | 1 | 0.1 | 0.5 | | | | | 0.4 | | CP | cmp bl，0x2b 
 | 2 | 0.5 | 0.2 | | | | | 1.2 | | CP | cmovz ecx，eax 
 | 1 | 0.2 | 0.5 | | | | | 0.2 | | CP | sub esi，edx 
 | 1 | 0.2 | 0.5 | | | | | 0.3 | | CP | sub esi，ecx 
 | 0 * | | | | | | | | | | xor eax，eax 
 | 1 | 0.3 | 0.1 | | | | | 0.6 | | CP | incesi 
 | 2 ^ | 0.3 | | 0.5 0.5 | 0.5 0.5 | | | 0.6 | | | cmp byte ptr [esi-0x1]，0x30 
 | 0F | | | | | | | | | | jz 0xfffffffb 
 | 2 ^ | 0.6 | | 0.5 0.5 | 0.5 0.5 | | 0.4 | | | cmp byte ptr [esi-0x1]，0x0 
 | 0F | | | | | | | | | | jz 0x8b 
 | 1 | 0.1 | 0.9 | | | | | | | CP | dec esi 
 | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqa xmm0，xmmword ptr [0x80492f0] 
 | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | movdqu xmm1，xmmword ptr [esi] 
 | 0 * | | | | | | | | | | pxor xmm2，xmm2 
 | 3 | 2.0 | 1.0 | | | | | | | CP | pcmpistri xmm0，xmm1，0x14 
 | 1 | | | | | | | 1.0 | | | jo 0x6e 
 | 1 | | 0.4 | | | | 0.1 | 0.5 | | | mov al，0x30 
 | 1 | 0.1 | 0.5 | | | | 0.1 | 0.3 | | CP | sub ecx，0x10 
 | 1 | | | | | | 1.0 | | | CP | movd xmm0，ecx 
 | 1 | | | | | | 1.0 | | | CP | pshufb xmm0，xmm2 
 | 2 ^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | CP | paddb xmm0，xmmword ptr [0x80492c0] 
 | 1 | | | | | | 1.0 | | | CP | pshufb xmm1，xmm0 
 | 1 | | | | | | 1.0 | | | | movd xmm0，eax 
 | 1 | | | | | | 1.0 | | | | pshufb xmm0，xmm2 
 | 1 | | 1.0 | | | | | | | CP | psubusb xmm1，xmm0 
 | 0 * | | | | | | | | | CP | movdqa xmm0，xmm1 
 | 1 | | | | | | 1.0 | | | CP | punpcklbw xmm0，xmm2 
 | 1 | | | | | | 1.0 | | | | punpckhbw xmm1，xmm2 
 | 2 ^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmaddwd xmm0，xmmword ptr [0x80492d0] 
 | 2 ^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | | pmaddwd xmm1，xmmword ptr [0x80492d0] 
 | 3 | | 1.0 | | | | 2.0 | | | CP | phaddd xmm0，xmm1 
 | 3 ^ | 2.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmulld xmm0，xmmword ptr [0x80492e0] 
 | 1 | | | | | | 1.0 | | | CP | pshufd xmm1，xmm0，0xee 
 | 1 | | 1.0 | | | | | | | CP | paddd xmm0，xmm1 
 | 1 | | | | | | 1.0 | | | CP | pshufd xmm1，xmm0，0x55 
 | 1 | | 1.0 | | | | | | | CP | paddd xmm0，xmm1 
 | 1 | 1.0 | | | | | | | | CP | movd eax，xmm0 
 | 1 | | | | | | | 1.0 | | CP | add eax，edx 
 | 1 | | | | | | | 1.0 | | CP | xor eax，edx 
 Uops总数：51

-asac 32位的延迟分析：

 延迟分析报告
 ----- ---------------------- 
延迟：64个周期
 
 N  - 端口号或周期数资源冲突引起的延迟，DV  - 分频器管道（在端口0上）
 D  - 关键路径上的数据提取管道（在端口2和3上），CP  -  
 F  - 与上一条指令发生的宏融合
 *  - 指令微操作未绑定到端口
 ^  -  Micro Fusion发生
＃ - 发出ESP跟踪同步uop 
 @  -  Intel（R）AVX到Intel（R）SSE代码开关，几十个周期的惩罚预期
！ - 不支持指令，未计入分析
 
资源延迟计数，因为所有指令源都准备就绪
，直到需要的资源可用
 
 | Inst |周期资源延迟| | 
 | Num | 0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 | FE | | 
 ---------------------------------------------- --------------------------- 
 | 0 | | | | | | | | | | | xor eax，eax 
 | 1 | | | | | | | | | | | xor ecx，ecx 
 | 2 | | | | | | | | | | | xor edx，edx 
 | 3 | | | | | | | | | | | dec eax 
 | 4 | | | | | | | | | 1 | CP | mov bl，byte ptr [esi] 
 | 5 | | | | | | | | | CP | cmp bl，0x2d 
 | 6 | | | | | | | | | | CP | cmovz edx，eax 
 | 5 | | | | | | | | | | CP | cmp bl，0x2b 
 | 8 | | | | | | | 1 | | | CP | cmovz ecx，eax 
 | 9 | | | | | | | | | | CP | sub esi，edx 
 | 10 | | | | | | | | | | CP | sub esi，ecx 
 | 11 | | | | | | | | | 3 | | xor eax，eax 
 | 12 | | | | | | | | | | CP | incesi 
 | 13 | | | | | | | | | | | cmp byte ptr [esi-0x1]，0x30 
 | 14 | | | | | | | | | | | jz 0xfffffffb 
 | 15 | | | | | | | | | | | cmp byte ptr [esi-0x1]，0x0 
 | 16 | | | | | | | | | | | jz 0x8b 
 | 17 | | | | | | | | | | CP | dec esi 
 | 18 | | | | | | | | | 4 | | movdqa xmm0，xmmword ptr [0x80492f0] 
 | 19 | | | | | | | | | | CP | movdqu xmm1，xmmword ptr [esi] 
 | 20 | | | | | | | | | 5 | | pxor xmm2，xmm2 
 | 21 | | | | | | | | | | CP | pcmpistri xmm0，xmm1，0x14 
 | 22 | | | | | | | | | | | jo 0x6e 
 | 23 | | | | | | | | | 6 | | mov al，0x30 
 | 24 | | | | | | | | | | CP | sub ecx，0x10 
 | 25 | | | | | | | | | | CP | movd xmm0，ecx 
 | 26 | | | | | | | | | | CP | pshufb xmm0，xmm2 
 | 27 | | | | | | | | | 7 | CP | paddb xmm0，xmmword ptr [0X80492c0] 
 | 28 | | | | | | | | | | CP | pshufb xmm1，xmm0 
 | 29 | | | | | | 1 | | | | | movd xmm0，eax 
 | 30 | | | | | | 1 | | | | | pshufb xmm0，xmm2 
 | 31 | | | | | | | | | | CP | psubusb xmm1，xmm0 
 | 32 | | | | | | | | | | CP | movdqa xmm0，xmm1 
 | 33 | | | | | | | | | | CP | punpcklbw xmm0，xmm2 
 | 34 | | | | | | | | | | | punpckhbw xmm1，xmm2 
 | 35 | | | | | | | | | 9 | CP | pmaddwd xmm0，xmmword ptr [0x80492d0] 
 | 36 | | | | | | | | | 9 | | pmaddwd xmm1，xmmword ptr [0x80492d0] 
 | 37 | | | | | | | | | | CP | phaddd xmm0，xmm1 
 | 38 | | | | | | | | | 10 | CP | pmulld xmm0，xmmword ptr [0x80492e0] 
 | 39 | | | | | | | | | | CP | pshufd xmm1，xmm0，0xee 
 | 40 | | | | | | | | | | CP | paddd xmm0，xmm1 
 | 41 | | | | | | | | | | CP | pshufd xmm1，xmm0，0x55 
 | 42 | | | | | | | | | | CP | paddd xmm0，xmm1 
 | 43 | | | | | | | | | | CP | movd eax，xmm0 
 | 44 | | | | | | | | | | CP | add eax，edx 
 | 45 | | | | | | | | | | CP | xor eax，edx 
 
关键路径上的资源冲突：
 ---------------------------- ------------------------------------- 
 |端口| 0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 | 
 ---------------------------------------------- ------------------- 
 |周期| 0 0 | 0 | 0 0 | 0 0 | 0 | 0 | 1 | 0 | 
 ---------------------------------------------- ------------------- 
 
关键路径上的延迟列表
 ------------- ------------------ 
 6  - > 8 1周期在Port6上延迟

Cordes用 imul 替换了最后两个 add + xor 指令。 这个OpCodes的集中度可能更高。不幸的是IACA不支持这个指令，并且抛出一个！ - 不支持指令，未在Analysis 注释中。尽管如此，尽管我喜欢OpCodes的减少和从（2uops，2c延迟）到（1 uop，3c延迟 - 更糟的延迟，但AMD仍然一个m-op），减少，我宁愿留给实施者，方式选择。我没有检查下面的代码是否足以解析任何数字。

另一种方法是将最后两行替换为：
p>

... / *如果负数则取反* / imul eax，edx ： / * EAX is return（u）int value * /

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.

If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.

The algorithm I came up with is the following:

I can initialize a vector of length N in the following fashion: [10^N..10^1]

I convert each character in the buffer to an integer and place them in another vector.

I take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results.

I'm targeting x86 and x86-64 architectures.

I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?
解决方案
The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).

The properties of this code are as follows:

Works for int and uint, from MIN_INT=-2147483648 to MAX_INT=2147483647 and from MIN_UINT=0 to MAX_UINT=4294967295

A leading '-' char indicates a negative number (as sensible), a leading '+' char is ignored

Leading zeros (with or without sign char) are ignored

Overflow is ignored - bigger numbers just wraparound

Zero length strings result in value 0 = -0

Invalid characters are recognized and the conversion ends at the first invalid char

At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller

Only SSE4.2 is needed

About this implementation:

This code sample can be run with GNU Assembler(as) using .intel_syntax noprefix at the beginning

Data footprint for constants is 64 bytes (4*128 bit XMM) equalling one cache line.

Code footprint is 46 instructions with 51 micro-Ops and 64 cycles latency

One loop for removal of leading zeros, otherwise no jumps except for error handling, so...

Time complexity is O(1)

The approach of the algorithm:
- Pointer to number string is expected in ESI - Check if first char is '-', then indicate if negative number in EDX (**A**) - Check for leading zeros and EOS (**B**) - Check string for valid digits and get strlen() of valid chars (**C**) - Reverse string so that power of 10^0 is always at BYTE 15 10^1 is always at BYTE 14 10^2 is always at BYTE 13 10^3 is always at BYTE 12 10^4 is always at BYTE 11 ... and mask out all following chars (**D**) - Subtract saturated '0' from each of the 16 possible chars (**1**) - Take 16 consecutive byte-values and and split them to WORDs in two XMM-registers (**2**) P O N M L K J I | H G F E D C B A -> H G F E | D C B A (XMM0) P O N M | L K J I (XMM1) - Multiply each WORD by its place-value modulo 10000 (1,10,100,1000) (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM) (**2**) so we can horizontally combine twice before another multiply. The PMADDWD instruction can do this and the next step: - Horizontally add adjacent WORDs to DWORDs (**3**) H*1000+G*100 F*10+E*1 | D*1000+C*100 B*10+A*1 (XMM0) P*1000+O*100 N*10+M*1 | L*1000+K*100 J*10+I*1 (XMM1) - Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**) xmmDst[31-0] = xmm0[63-32] + xmm0[31-0] xmmDst[63-32] = xmm0[127-96] + xmm0[95-64] xmmDst[95-64] = xmm1[63-32] + xmm1[31-0] xmmDst[127-96] = xmm1[127-96] + xmm1[95-64] - Values in XMM0 are multiplied with the factors (**5**) P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings) L*1000+K*100+J*10+I*1 (DWORD factor 100000000) H*1000+G*100+F*10+E*1 (DWORD factor 10000) D*1000+C*100+B*10+A*1 (DWORD factor 1) - The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**6**) xmm0[63-32] = xmm0[127-96] + xmm0[95-64] (the upper QWORD contains the same and is ignored) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**7**) - If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)
And the sample implementation in GNU Assembler with intel syntax:
.intel_syntax noprefix .data .align 64 ddqDigitRange: .byte '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0 ddqShuffleMask:.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 ddqFactor1: .word 1,10,100,1000, 1,10,100,1000 ddqFactor2: .long 1,10000,100000000,0 .text _start: mov esi, lpInputNumberString /* (**A**) indicate negative number in EDX */ mov eax, -1 xor ecx, ecx xor edx, edx mov bl, byte ptr [esi] cmp bl, '-' cmove edx, eax cmp bl, '+' cmove ecx, eax sub esi, edx sub esi, ecx /* (**B**)remove leading zeros */ xor eax,eax /* return value ZERO */ remove_leading_zeros: inc esi cmp byte ptr [esi-1], '0' /* skip leading zeros */ je remove_leading_zeros cmp byte ptr [esi-1], 0 /* catch empty string/number */ je FINISH dec esi /* check for valid digit-chars and invert from front to back */ pxor xmm2, xmm2 movdqa xmm0, xmmword ptr [ddqDigitRange] movdqu xmm1, xmmword ptr [esi] pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */ jo FINISH /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */ mov al , '0' /* value to subtract from chars */ sub ecx, 16 /* 16-len=negative to zero for shuffle mask */ movd xmm0, ecx pshufb xmm0, xmm2 /* broadcast CL to all 16 BYTEs */ paddb xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */ pshufb xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */ movd xmm0, eax /* AL='0' from above */ pshufb xmm0, xmm2 /* broadcast AL to XMM0 */ psubusb xmm1, xmm0 /* (**1**) */ movdqa xmm0, xmm1 punpcklbw xmm0, xmm2 /* (**2**) */ punpckhbw xmm1, xmm2 pmaddwd xmm0, xmmword ptr [ddqFactor1] /* (**3**) */ pmaddwd xmm1, xmmword ptr [ddqFactor1] phaddd xmm0, xmm1 /* (**4**) */ pmulld xmm0, xmmword ptr [ddqFactor2] /* (**5**) */ pshufd xmm1, xmm0, 0b11101110 /* (**6**) */ paddd xmm0, xmm1 pshufd xmm1, xmm0, 0b01010101 /* (**7**) */ paddd xmm0, xmm1 movd eax, xmm0 /* negate if negative number */ add eax, edx /* (**8**) */ xor eax, edx FINISH: /* EAX is return (u)int value */
The result of Intel-IACA Throughput Analysis for Haswell 32-bit:
Throughput Analysis Report -------------------------- Block Throughput: 16.10 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 9.5 0.0 | 10.0 | 4.5 4.5 | 4.5 4.5 | 0.0 | 11.1 | 11.4 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 0* | | | | | | | | | | xor eax, eax | 0* | | | | | | | | | | xor ecx, ecx | 0* | | | | | | | | | | xor edx, edx | 1 | | 0.1 | | | | | 0.9 | | | dec eax | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov bl, byte ptr [esi] | 1 | | | | | | | 1.0 | | CP | cmp bl, 0x2d | 2 | 0.1 | 0.2 | | | | | 1.8 | | CP | cmovz edx, eax | 1 | 0.1 | 0.5 | | | | | 0.4 | | CP | cmp bl, 0x2b | 2 | 0.5 | 0.2 | | | | | 1.2 | | CP | cmovz ecx, eax | 1 | 0.2 | 0.5 | | | | | 0.2 | | CP | sub esi, edx | 1 | 0.2 | 0.5 | | | | | 0.3 | | CP | sub esi, ecx | 0* | | | | | | | | | | xor eax, eax | 1 | 0.3 | 0.1 | | | | | 0.6 | | CP | inc esi | 2^ | 0.3 | | 0.5 0.5 | 0.5 0.5 | | | 0.6 | | | cmp byte ptr [esi-0x1], 0x30 | 0F | | | | | | | | | | jz 0xfffffffb | 2^ | 0.6 | | 0.5 0.5 | 0.5 0.5 | | | 0.4 | | | cmp byte ptr [esi-0x1], 0x0 | 0F | | | | | | | | | | jz 0x8b | 1 | 0.1 | 0.9 | | | | | | | CP | dec esi | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqa xmm0, xmmword ptr [0x80492f0] | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 0* | | | | | | | | | | pxor xmm2, xmm2 | 3 | 2.0 | 1.0 | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 1 | | | | | | | 1.0 | | | jo 0x6e | 1 | | 0.4 | | | | 0.1 | 0.5 | | | mov al, 0x30 | 1 | 0.1 | 0.5 | | | | 0.1 | 0.3 | | CP | sub ecx, 0x10 | 1 | | | | | | 1.0 | | | CP | movd xmm0, ecx | 1 | | | | | | 1.0 | | | CP | pshufb xmm0, xmm2 | 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | CP | paddb xmm0, xmmword ptr [0x80492c0] | 1 | | | | | | 1.0 | | | CP | pshufb xmm1, xmm0 | 1 | | | | | | 1.0 | | | | movd xmm0, eax | 1 | | | | | | 1.0 | | | | pshufb xmm0, xmm2 | 1 | | 1.0 | | | | | | | CP | psubusb xmm1, xmm0 | 0* | | | | | | | | | CP | movdqa xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | punpcklbw xmm0, xmm2 | 1 | | | | | | 1.0 | | | | punpckhbw xmm1, xmm2 | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 3 | | 1.0 | | | | 2.0 | | | CP | phaddd xmm0, xmm1 | 3^ | 2.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0xee | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0x55 | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | 1.0 | | | | | | | | CP | movd eax, xmm0 | 1 | | | | | | | 1.0 | | CP | add eax, edx | 1 | | | | | | | 1.0 | | CP | xor eax, edx Total Num Of Uops: 51
The result of Intel-IACA Latency Analysis for Haswell 32-bit:
Latency Analysis Report --------------------------- Latency: 64 Cycles N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis The Resource delay is counted since all the sources of the instructions are ready and until the needed resource becomes available | Inst | Resource Delay In Cycles | | | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | | ------------------------------------------------------------------------- | 0 | | | | | | | | | | | xor eax, eax | 1 | | | | | | | | | | | xor ecx, ecx | 2 | | | | | | | | | | | xor edx, edx | 3 | | | | | | | | | | | dec eax | 4 | | | | | | | | | 1 | CP | mov bl, byte ptr [esi] | 5 | | | | | | | | | | CP | cmp bl, 0x2d | 6 | | | | | | | | | | CP | cmovz edx, eax | 7 | | | | | | | | | | CP | cmp bl, 0x2b | 8 | | | | | | | 1 | | | CP | cmovz ecx, eax | 9 | | | | | | | | | | CP | sub esi, edx | 10 | | | | | | | | | | CP | sub esi, ecx | 11 | | | | | | | | | 3 | | xor eax, eax | 12 | | | | | | | | | | CP | inc esi | 13 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x30 | 14 | | | | | | | | | | | jz 0xfffffffb | 15 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x0 | 16 | | | | | | | | | | | jz 0x8b | 17 | | | | | | | | | | CP | dec esi | 18 | | | | | | | | | 4 | | movdqa xmm0, xmmword ptr [0x80492f0] | 19 | | | | | | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 20 | | | | | | | | | 5 | | pxor xmm2, xmm2 | 21 | | | | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 22 | | | | | | | | | | | jo 0x6e | 23 | | | | | | | | | 6 | | mov al, 0x30 | 24 | | | | | | | | | | CP | sub ecx, 0x10 | 25 | | | | | | | | | | CP | movd xmm0, ecx | 26 | | | | | | | | | | CP | pshufb xmm0, xmm2 | 27 | | | | | | | | | 7 | CP | paddb xmm0, xmmword ptr [0x80492c0] | 28 | | | | | | | | | | CP | pshufb xmm1, xmm0 | 29 | | | | | | 1 | | | | | movd xmm0, eax | 30 | | | | | | 1 | | | | | pshufb xmm0, xmm2 | 31 | | | | | | | | | | CP | psubusb xmm1, xmm0 | 32 | | | | | | | | | | CP | movdqa xmm0, xmm1 | 33 | | | | | | | | | | CP | punpcklbw xmm0, xmm2 | 34 | | | | | | | | | | | punpckhbw xmm1, xmm2 | 35 | | | | | | | | | 9 | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 36 | | | | | | | | | 9 | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 37 | | | | | | | | | | CP | phaddd xmm0, xmm1 | 38 | | | | | | | | | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 39 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0xee | 40 | | | | | | | | | | CP | paddd xmm0, xmm1 | 41 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0x55 | 42 | | | | | | | | | | CP | paddd xmm0, xmm1 | 43 | | | | | | | | | | CP | movd eax, xmm0 | 44 | | | | | | | | | | CP | add eax, edx | 45 | | | | | | | | | | CP | xor eax, edx Resource Conflict on Critical Paths: ----------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------- | Cycles | 0 0 | 0 | 0 0 | 0 0 | 0 | 0 | 1 | 0 | ----------------------------------------------------------------- List Of Delays On Critical Paths ------------------------------- 6 --> 8 1 Cycles Delay On Port6
An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor instructions by an imul. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).

The alternative may be replacing the last two lines with:
... /* negate if negative number */ imul eax, edx FINISH: /* EAX is return (u)int value */

这篇关于如何使用SIMD实现atoi？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用SIMD实现atoi？ [英] How to implement atoi using SIMD?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何使用SIMD实现atoi？ [英] How to implement atoi using SIMD?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭