加快64汇编ADD循环 [英] Speed up x64 assembler ADD loop

查看:255
本文介绍了加快64汇编ADD循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的算术很长整数(大约10万个十进制数字)的乘积。由于我的图书馆的一部分,我将两个长数字。

I'm working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.

分析显示,我的code运行高达25%的它的时间在add()和子()程序,所以重要的是他们是尽可能快。但我没有看到很大的潜力,但。也许你可以给我一些帮助,建议,见解或观点。我要对它们进行测试,并送还给你。

Profiling shows that my code runs up to 25% of it's time in the add() and sub() routines, so it's important they are as fast as possible. But I don't see much potential, yet. Maybe you can give me some help, advice, insight or ideas. I'll test them and get back to you.

到目前为止,我的日常加做一些设置,然后使用8次展开循环:

So far my add routine does some setup and then uses a 8-times unrolled loop:

mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
adc rax, r10
mov QWORD PTR [rcx+r11*8-64], rax

7更多的带有不同的偏移跟随,然后循环。

7 more blocks with different offsets follow and then it loops.

我试图从早期载入内存的值,但没有帮助。我想这是因为良好的prefetching。我使用英特尔i7-3770的Ivy Bridge四核CPU。但我想写code,任何现代CPU上工作良好。

I tried loading the values from memory earlier, but that didn't help. I guess that is because of good prefetching. I use an Intel i7-3770 Ivy Bridge 4-core CPU. But I'd like to write code that works good on any modern CPU.

修改:我做了一些计时:它增加了1K字约2.25次/字。如果我删除了ADC,所以只有保持的MOV,它仍然需要大约1.95次/字。这样的主要瓶颈似乎是存储器存取。库的memcpy()约0.65次/字的作品,但只有一个输入,而不是两个。不过,这是它的使用SSE寄存器的更快,因为,我猜。

Edit: I did some timings: It adds 1k words in about 2.25 cycles/word. If I remove the ADC, so only the MOVs remain, it still takes about 1.95 cycles/word. So the main bottleneck seems to be the memory access. A library memcpy() works in about 0.65 cycles/word, but has only one input, not two. Still, it's much faster because of its use of SSE registers, I guess.

一些问题:


  • 是否有用用载,负载,添加店结构或将负载,加入到内存的帮助吗?到目前为止,我的测试没有显示出任何优势。

  • 像往常一样,从没有SSE帮助(2,3,4)可以预期?

  • 是否寻址(缩放指数+基地+偏移量)影响不好?我可以使用 ADD R11,8 代替。

  • 有关循环展开是什么?我读展开是坏的Sandy Bridge架构(瓦格纳雾 http://www.agner.org/optimize/ )。难道是preferred或避免?

  • (编辑)我可以使用SSE寄存器来加载和从内存较大的块存储单词和有效地与通用寄存器和SSE寄存器交换字?

  • Is it useful to use "load, load, add, store" structure or would a "load, add-to-memory" help? So far my tests didn't show any advantages.
  • As usual, there is no help from SSE(2,3,4) to be expected?
  • Does the addressing (scaled index plus base plus offset) impact badly? I could use ADD r11, 8 instead.
  • What about the loop unrolling? I read unrolling was bad for Sandy Bridge architecture (Agner Fog http://www.agner.org/optimize/). Is it to be preferred or avoided?
  • (Edit) Can I use SSE registers to load and store words in larger chunks from memory and efficiently exchange words with general purpose registers and SSE registers?

我强烈AP preciate任何意见。

I highly appreciate any comments.

推荐答案

我是pretty肯定的memcpy是更快,因为它不会对获取的数据的依赖,才可以进行下一步操作。

I'm pretty sure memcpy is faster because it doesn't have a dependency on the data being fetched before it can perform the next operation.

如果你可以安排你的code,以便它是这样的:

If you can arrange your code so that it does something like this:

mov rax, QWORD PTR [rdx+r11*8-64]
mov rbx, QWORD PTR [rdx+r11*8-56]
mov r10, QWORD PTR [r8+r11*8-64]
mov r12, QWORD PTR [r8+r11*8-56]
adc rax, r10
adc rbx, r12
mov QWORD PTR [rcx+r11*8-64], rax
mov QWORD PTR [rcx+r11*8-56], rbx

我不知道100%的-56偏移是您code正确的,但概念是正确的。

I'm not 100% sure that the offset of -56 is the right for your code, but the concept is "right".

我也会考虑缓存命中/缓存冲突。例如。如果你有数据[这又好像你做]的三个街区,你确保他们不对齐到相同的高速缓存中的偏移量。如果您在高速缓存大小的倍数分配所有的块,在缓存中的同一个地方一个坏榜样会。过度分配,并确保你的不同的数据块由至少512字节偏移使分配4K超大,并四舍五入到4K边界的起始地址,再加入512到第二缓冲器,和1024到第三个缓冲器]

I would also consider cache-hits/cache-collisions. E.g. if you have three blocks of data [which it would seem that you do], you make sure they are NOT aligned to the same offset in the cache. A bad example would be if you allocate all your blocks at a multiple of the cache-size, from the same place in the cache. Over-allocating and making SURE that your different data blocks are offset by at least 512 byte [so allocate 4K oversize, and round up to 4K boundary start address, then add 512 to the second buffer, and 1024 to the third buffer]

如果你的数据是足够大(高于二级缓存更大),您可能需要使用MOVNT来读取/存储数据。这将避免读取到缓存中 - 这是利益只有当你有非常大的数据,其中下一个读只会导致别的东西,你会发现有用被踢出缓存,你不会得到回到它之前已经踢出来的缓存反正 - 所以在缓存中储存实际上并不会帮助值...

If your data is large enough (bigger than L2 cache), you may want to use MOVNT to fetch/store your data. That will avoid reading into the cache - this is ONLY of benefit when you have very large data, where the next read will simply cause something else that you may find "useful" to be kicked out of the cache, and you won't get back to it before you've kicked it out of the cache anyways - so storing the value in the cache won't actually help...

编辑:使用SSE或类似的无助,因为这里介绍:
能长整数程序上证中受益?

Using SSE or similar won't help, as covered here: Can long integer routines benefit from SSE?

这篇关于加快64汇编ADD循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆