最快的x86汇编code同步访问数组? [英] Fastest x86 assembly code to synchronize access to an array?

查看:161
本文介绍了最快的x86汇编code同步访问数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是最快的x86汇编code同步访问数组在内存?

What is the fastest x86 assembly code to synchronize access to an array in memory?

要更precise:我们在内存中有一个malloc分配连续的单分页区域和OS不会页出该地区对于我们实验的持续时间。一个线程会写入阵列中,一个线程将从阵列读取。阵虽小,但比你的CPU(让一个单独的锁实际上可以必填)

To be more precise: We have a malloc'ed continuous single-paged region in memory and the OS will not page-out this region for the duration of our experiment. One thread will write to the array, one thread will read from the array. the array is small, but larger than the atomic-write capability of your cpu (so that a separate lock is acutally required)

最快:有效速度:不要简单地假定字节code的长度为显著,但考虑到锁定的缓存行为和分支对周围code行为

"fastest": the effective speed: Do not just assume the length of bytecode is significant but take into account the caching behavior of the lock and branching behavior regarding surrounding code.

它在X86-32工作和/或X86-64

It has to work on x86-32 and/or x86-64

它必须在最上层的(或后代)工作的Windows XP以来,Linux的内核自2.2,或MAXOS X(用户模式)。

It has to work on-top of (or descendents of) Windows since XP, Linux since kernel 2.2, or MaxOs X (in user-mode).

请没有这取决于-responses:如果这取决于什么,我在这里没有指定只是弥补自己的例子(S)和状态是什么在最快/这些情况下,(S)

Please no "it depends"-responses: If it depends on anything I have not specified here just make up your own example(s) and state what is fastest in that/those case(s).

邮政code! (这是prevent模糊的描述)

Post code! (This is to prevent vague descriptions)

邮政不仅是你的2行锁定 + CMPXCHG 比较和放大器;掉期,但我们展示你如何整合呢在一个线程的读指令,并在其它的写指令。

Post not only your 2-line LOCK + CMPXCHG compare&swap but show us how you integrate it with the read instructions in the one thread and the write-instructions in the other.

如果你喜欢,说明你的缓存最优调整以及如何避免分支误​​predictions如果分支目标取决于(1)是否获得锁与否(2)什么的第一个字节较大读的​​是。

If you like, explain your tweaks for cache-optimality and how to avoid branch-mispredictions if the branch-target is dependant on (1) whether you get the lock or not (2) what the first byte of a larger-read is.

如果你想多和任务切换区分?如何如果线程不上2个CPU执行,但只得到一抱会你code执行

If you like distinguish between multiprocessing and task-switching: how will your code perform if the threads are not performed on 2 cpus but just get hold of one?

推荐答案

真的,答案是看情况。什么是你的阵列的使用模式?难道读为主?难道更新,主要是,你可以逃脱IM precise结果对阅读(使用每CPU的数组)?更新是如此罕见的RCU将给予严肃的性能提升?

Really, the answer is "it depends". What's the usage pattern of your array? Is it read-mostly? Is it update-mostly and you can get away with imprecise results on reading (using per-cpu arrays)? Updates are so infrequent that RCU would give serious performance improvements?

有很多权衡这里,看保罗麦肯尼的书:的是并行编程辛苦了,如果是这样,你可以做些什么呢?

There are lots of tradeoffs here, see Paul McKenney's book: Is Parallel Programming Hard, And, If So, What Can You Do About It?

这篇关于最快的x86汇编code同步访问数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆