持久性内存缓存策略以进行读写 [英] Persistent memory cache policy to write and read

查看:112
本文介绍了持久性内存缓存策略以进行读写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人在尝试使用解决方案

(这主要是猜测,我尚未使用Optane DC PM进行任何性能测试,只是偶尔阅读了有关DRAM的UC或WT.但是我认为人们对它们的总体工作方式了解得足够多,这对于许多工作负载来说可能不是一个好主意.)

有关Optane DC PM DIMM的更多信息: https://thememoryguy.com/whats-in-optane-dimm/-它们包括损耗均衡的重新映射层,例如SSD.

也相关:Intel Optane DC Memory (DCPMM) in App Direct Mode (that is as non-volatile memory) to write or read to/from it using Write Through (WT) or Un-Cacheable (UC) memory policies? The idea is to use regular memory as non-volatile (data is not lost in case of failure), having dirty cache lines is not ideal since is volatile. There are multiple links that show examples using Write Back (WB) or Write Combining (WC) with non-temporal access (NTA) instructions, also using WB and CLFLUSHOPT or CLWB write instructions. Are there any important drawbacks other than bandwidth, not writing an entire cache line to memory when using WT/UC compared to WB/WC?

解决方案

(This is mostly speculation, I haven't done any performance testing with Optane DC PM, and only read about UC or WT for DRAM occasionally. But I think enough is known about how they work in general to say it's probably a bad idea for many workloads.)

Further reading about Optane DC PM DIMMs: https://thememoryguy.com/whats-inside-an-optane-dimm/ - they include a wear-leveling remapping layer like an SSD.

Also related: When I test AEP memory, I found that flushing a cacheline repeatedly has a higher latency than flushing different cachelines. I want to know what caused this phenomenon. Is it wear leveling mechanism ? on Intel forums. That would indicate that repeated writes to the same cache line might be even worse than you might expect.


UC also implies strong ordering which would hurt OoO exec, I think. I think UC also stops you from using NT stores for full-line writes. It would also totally destroy read performance so I don't think it's worth considering.

WT is maybe worth considering as an alternative to clwb (assuming it actually works with NV memory), but you'd still have to be careful about compile-time reordering of stores. _mm_clwb is presumably a compiler memory barrier that would prevent such problems.

In a store-heavy workload, you'd expect serious slowdowns in writes, though. Per-core memory bandwidth is very much limited by number of outstanding requests. Making each request smaller (only 8 bytes or something instead of a whole line) don't make it appreciably faster. The vast majority of the time is in getting the request through the memory hierarchy, and waiting for the address lines to select the right place, not the actual burst transfer over the memory bus. (This is pipelined so with multiple full-line requests to the same DRAM page a memory controller can spend most of its time transferring data, not waiting, I think. Optane / 3DXPoint isn't as fast as DRAM so there may be more waiting.)

So for example, storing contiguous int64_t or double would take 8 separate stores per 64-byte cache line, unless you (or the compiler) vectorizes. With WT instead of WB + clwb, I'd guess that would be about 8x slower. This is not based on any real performance details about Optane DC PM; I haven't seen memory latency / bandwidth numbers, and I haven't looked at WT performance. I have seen occasional papers that compare synthetic workloads with WT vs. WB caching on real Intel hardware on regular DDR DRAM, though. I think it's usable if multiple writes to the same cache line aren't typical for your code. (But normally that's something you want to do and optimize for, because WB caching makes it very cheap.)

If you have AVX512, that lets you do full-line 64-byte stores, if you make sure they're aligned. (Which you generally want for performance with 512-bit vectors anyway).

这篇关于持久性内存缓存策略以进行读写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆