如何将结构显式加载到 L1d 缓存中? [英] How to explicitly load a structure into L1d cache?

查看:41
本文介绍了如何将结构显式加载到 L1d 缓存中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是将静态结构加载到 L1D 缓存中.之后使用这些结构成员执行一些操作,并在完成操作后运行 invd 以丢弃所有修改过的缓存行.所以基本上我想使用在缓存内部创建一个安全的环境,以便在缓存内部执行操作时,数据不会泄漏到RAM中.

My goal is to load a static structure into the L1D cache. After that performing some operation using those structure members and after done with the operation run invd to discard all the modified cache lines. So basically I want to use create a secure environment inside the cache so that, while performing operations inside the cache, data will not be leaked into the RAM.

为此,我有一个内核模块.我在结构的成员上放置了一些固定值.然后我禁用抢占,禁用所有其他 CPU(当前 CPU 除外)的缓存,禁用中断,然后使用 __builtin_prefetch() 将我的静态结构加载到缓存中.之后,我用新值覆盖之前放置的固定值.之后,我执行 invd(清除修改后的缓存行),然后启用所有其他 CPU 的缓存,启用中断 &启用抢占.我的理由是,当我在原子模式下执行此操作时,INVD 将删除所有更改.从原子模式回来后,我应该看到我之前放置的原始固定值.然而,这并没有发生.退出原子模式后,我可以看到用于覆盖先前放置的固定值的值.这是我的模块代码,

To do this, I have a kernel module. Where I placed some fixed values on the members of a structure. Then I disable preemption, disable cache for all other CPU (except current CPU), disable interrupt, then using __builtin_prefetch() to load my static structure into the cache. And after that, I overwrite the previously placed fixed values with new values. After that, I execute invd (to clear the modified cache line) and then enable cache to all other CPUs, enable interrupt & enable preemption. My rationale is, as I'm doing this while in atomic mode, INVD will remove all the changes. And after coming back from atomic mode, I should see the original fixed values that I have placed previously. That is however not happening. After coming out of the atomic mode, I can see the values, that Used to overwrite the previously placed fixed values. Here is my module code,

奇怪的是,在重新启动 PC 后,我的输出发生了变化,我只是不明白为什么.现在,我根本没有看到任何变化.我正在发布完整的代码,包括@Peter Cordes 建议的一些修复,

It's strange that after rebooting the PC, my output changes, I just don't understand why. Now, I'm not seeing any changes at all. I'm posting the full code including some fix @Peter Cordes suggested,

#include <linux/module.h>    
#include <linux/kernel.h>    
#include <linux/init.h>      
#include <linux/moduleparam.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Author");
MODULE_DESCRIPTION("test INVD");

static struct CACHE_ENV{
    unsigned char in[128];
    unsigned char out[128];
}cacheEnv __attribute__((aligned(64)));

#define cacheEnvSize (sizeof(cacheEnv)/64)
//#define change "Hello"
unsigned char change[]="hello";


void disCache(void *p){
    __asm__ __volatile__ (
        "wbinvd\n"
        "mov %%cr0, %%rax\n\t"
        "or $(1<<30), %%eax\n\t"
        "mov %%rax, %%cr0\n\t"
        "wbinvd\n"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache disable\n", smp_processor_id());

}


void enaCache(void *p){
    __asm__ __volatile__ (
        "mov %%cr0, %%rax\n\t"
        "and $~(1<<30), %%eax\n\t"
        "mov %%rax, %%cr0\n\t"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache enable\n", smp_processor_id());

}

int changeFixedValue (struct CACHE_ENV *env){
    int ret=1;
    //memcpy(env->in, change, sizeof (change));
    //memcpy(env->out, change,sizeof (change));

    strcpy(env->in,change);
    strcpy(env->out,change);
    return ret;
}

void fillCache(unsigned char *p, int num){
    int i;
    //unsigned char *buf = p;
    volatile unsigned char *buf=p;

    for(i=0;i<num;++i){
    
/*
        asm volatile(
        "movq $0,(%0)\n"
        :
        :"r"(buf)
        :
        );
*/
        //__builtin_prefetch(buf,1,1);
        //__builtin_prefetch(buf,0,3);
        *buf += 0;
        buf += 64;   
     }
    printk(KERN_INFO "Inside fillCache, num is %d\n", num);
}

static int __init device_init(void){
    unsigned long flags;
    int result;

    struct CACHE_ENV env;

    //setup Fixed values
    char word[] ="0xabcd";
    memcpy(env.in, word, sizeof(word) );
    memcpy(env.out, word, sizeof (word));
    printk(KERN_INFO "env.in fixed is %s\n", env.in);
    printk(KERN_INFO "env.out fixed is %s\n", env.out);

    printk(KERN_INFO "Current CPU %s\n", smp_processor_id());

    // start atomic
    preempt_disable();
    smp_call_function(disCache,NULL,1);
    local_irq_save(flags);

    asm("lfence; mfence" ::: "memory");
    fillCache(&env, cacheEnvSize);
    
    result=changeFixedValue(&env);

    //asm volatile("invd\n":::);
    asm volatile("invd\n":::"memory");

    // exit atomic
    smp_call_function(enaCache,NULL,1);
    local_irq_restore(flags);
    preempt_enable();

    printk(KERN_INFO "After: env.in is %s\n", env.in);
    printk(KERN_INFO "After: env.out is %s\n", env.out);

    return 0;
}

static void __exit device_cleanup(void){
    printk(KERN_ALERT "Removing invd_driver.\n");
}

module_init(device_init);
module_exit(device_cleanup);

我得到以下输出:

[ 3306.345292] env.in fixed is 0xabcd
[ 3306.345321] env.out fixed is 0xabcd
[ 3306.345322] Current CPU (null)
[ 3306.346390] cpuid 1 --> cache disable
[ 3306.346611] cpuid 3 --> cache disable
[ 3306.346844] cpuid 2 --> cache disable
[ 3306.347065] cpuid 0 --> cache disable
[ 3306.347313] cpuid 4 --> cache disable
[ 3306.347522] cpuid 5 --> cache disable
[ 3306.347755] cpuid 6 --> cache disable
[ 3306.351235] Inside fillCache, num is 4
[ 3306.352250] cpuid 3 --> cache enable
[ 3306.352997] cpuid 5 --> cache enable
[ 3306.353197] cpuid 4 --> cache enable
[ 3306.353220] cpuid 6 --> cache enable
[ 3306.353221] cpuid 2 --> cache enable
[ 3306.353221] cpuid 1 --> cache enable
[ 3306.353541] cpuid 0 --> cache enable
[ 3306.353608] After: env.in is hello
[ 3306.353609] After: env.out is hello

我的Makefile

obj-m += invdMod.o
CFLAGS_invdMod.o := -o0
invdMod-objs := disable_cache.o  

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
    rm -f *.o

有没有想过我做错了什么?正如我之前所说,我希望我的输出保持不变.

Any thought about what I'm doing incorrectly? As I said before, I expect my output to remain unchanged.

我能想到的一个原因是 __builtin_prefetch() 没有将结构放入缓存中.另一种将内容放入缓存的方法是在 MTRR 的帮助下设置一个 write-back 区域.PAT.但是,我对如何实现这一目标一无所知.我发现 12.6.使用 ioctl()'s 从 C 程序创建 MTRR 显示了如何创建 MTRR 区域,但我不知道如何将我的结构的地址与该区域绑定.

One reason I can think of is that __builtin_prefetch() is not putting the structure into the cache. Another way to put something into the cache is by setting up a write-back region with the help of MTRR & PAT. However, I'm kind of clueless about how to achieve that. I found 12.6. Creating MTRRs from a C programme using ioctl()’s shows how to create a MTRR region but I can't figure out how Can I bind the address of my structure with that region.

我的 CPU 是:Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

内核版本:Linux xxx 4.4.0-200-generic #232-Ubuntu SMP Wed Jan 13 10:18:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

GCC 版本:gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

我用-O0参数编译了这个模块

I have compiled this module with -O0 parameter

更新 2:关闭超线程

我用 echo off > 关闭了超线程/sys/devices/system/cpu/smt/control.之后,运行我的模块看起来像 changeFixedValue() &fillCache() 没有被调用.

I turned off hyperthreading with echo off > /sys/devices/system/cpu/smt/control. After that, running my module seems like, changeFixedValue() & fillCache() are not getting called.

输出:

[ 3971.480133] env.in fixed is 0xabcd
[ 3971.480134] env.out fixed is 0xabcd
[ 3971.480135] Current CPU 3
[ 3971.480739] cpuid 2 --> cache disable
[ 3971.480956] cpuid 1 --> cache disable
[ 3971.481175] cpuid 0 --> cache disable
[ 3971.482771] cpuid 2 --> cache enable
[ 3971.482774] cpuid 0 --> cache enable
[ 3971.483043] cpuid 1 --> cache enable
[ 3971.483065] After: env.in is 0xabcd
[ 3971.483066] After: env.out is 0xabcd

推荐答案

在fillCache 底部调用printk 看起来很不安全.您将要运行更多的存储然后 invd,因此 printk 对内核数据结构(如日志缓冲区)所做的任何修改都可能会被写回 DRAM 或如果它们在缓存中仍然脏,则可能会失效.如果某些但不是所有存储都进入 DRAM(因为缓存容量有限),您可能会使内核数据结构处于不一致状态.

It looks very unsafe to call printk at the bottom of fillCache. You're about to run a few more stores then an invd, so any modifications printk makes to kernel data structures (like the log buffer) might get written back to DRAM or might get invalidated if they're still dirty in cache. If some but not all stores make it to DRAM (because of limited cache capacity), you could leave kernel data structures in an inconsistent state.

我猜您当前禁用 HT 的测试表明一切都比您希望的要好,包括丢弃由 printk 完成的存储,以及丢弃存储由 changeFixedValue 完成.这可以解释为什么代码完成后没有留给用户空间阅读的日志消息.

I'd guess that your current tests with HT disabled show everything working even better than you hoped, including discarding stores done by printk, as well as discarding the stores done by changeFixedValue. That would explain the lack of log messages left for user-space to read once your code finishes.

要对此进行测试,您最好clflush printk 所做的一切,但没有简单的方法可以做到这一点.也许 wbinvd 然后 changeFixedValue 然后 invd.(您没有在此核心上进入无填充模式,因此 fillCache 不是您的商店/invd 想法工作所必需的,见下文.)

To test this, you'd ideally want to clflush everything printk did, but there's no easy way to do that. Perhaps wbinvd then changeFixedValue then invd. (You're not entering no-fill mode on this core, so fillCache isn't necessary for your store / invd idea to work, see below.)

CR0.CD 是每个物理核心,所以让你的 HT 兄弟核心禁用缓存也意味着隔离核心的 CD=1.因此,启用 HT 后,即使在隔离的核心上,您也处于无填充模式.

CR0.CD is per-physical-core, so having your HT sibling core disable cache also means CD=1 for the isolated core. So with HT enabled, you were in no-fill mode even on the isolated core.

禁用HT后,隔离核心仍然正常.

With HT disabled, the isolated core is still normal.

asm volatile(invd\n":::); 没有 memory" 破坏器告诉编译器它允许重新排序它.内存操作.显然,这不是您的问题,但这是您应该修复的错误.

asm volatile("invd\n":::); without a "memory" clobber tells the compiler it's allowed to reorder it wrt. memory operations. Apparently that isn't the problem in your case, but it's a bug you should fix.

asm("mfence; lfence" ::: "memory"); 放在 fillCache 之前可能也是一个好主意,以确保任何缓存-miss 加载和存储不在运行中,并且可能在您的代码运行时分配新的缓存行.或者甚至可能是一个完全序列化的指令,如 asm("xor %eax,%eax; cpuid" ::: "eax", "ebx", "ecx", "edx", "memory";);,但我不知道 CPUID 阻止了哪个 mfence;围栏不会.

Probably also a good idea to put asm("mfence; lfence" ::: "memory"); right before fillCache, to make sure any cache-miss loads and stores aren't still in flight and maybe allocating new cache lines while your code is running. Or possibly even a fully serializing instruction like asm("xor %eax,%eax; cpuid" ::: "eax", "ebx", "ecx", "edx", "memory");, but I don't know of anything that CPUID blocks which mfence; lfence wouldn't.

PREFETCHT0(进入 L1d 缓存)是 __builtin_prefetch(p,0,3);.这个答案显示了 args 如何映射到指令;您正在使用 prefetchw(写入意图),或者我认为 prefetcht1(L2 缓存)取决于编译器选项.

PREFETCHT0 (into L1d cache) is __builtin_prefetch(p,0,3);. This answer shows how args maps to instructions; you're using prefetchw (write-intent) or I think prefetcht1 (L2 cache) depending on compiler options.

但实际上,由于您需要这样做以确保正确性,因此您不应该使用硬件在繁忙时可以丢弃的可选提示. mfence;lfence 会让硬件不太可能真的很忙,但仍然不是一个坏主意.

But really since you need this for correctness, you shouldn't be using optional hints that the HW can drop if it's busy. mfence; lfence would make it unlikely for the HW to actually be busy, but still not a bad idea.

使用类似于 READ_ONCEvolatile 来让 GCC 发出加载指令.或者使用 volatile char *buf*buf |= 0; 或其他真正 RMW 而不是预取的东西,以确保该行是独占的,而不必让 GCC发出 prefetchw.

Use a volatile read like READ_ONCE to get GCC to emit a load instruction. Or use volatile char *buf with *buf |= 0; or something to truly RMW instead of prefetch, to make sure the line is exclusively owned without having to get GCC to emit prefetchw.

也许值得运行 fillCache 几次,只是为了更确保每一行都正确地处于您想要的状态.但是由于您的 env 小于 4k,所以每一行都将位于 L1d 缓存中的不同集合中,因此在分配另一行时不会有一行被丢弃的风险(除非 L3 缓存的哈希函数中有别名?但即便如此, 伪 LRU 驱逐应该可靠地保留最近的行.)

Perhaps worth running fillCache a couple times, just to make more sure that every line is properly in the state you want. But since your env is smaller than 4k, each line will be in a different set in L1d cache, so there's no risk that one line got tossed out while allocating another (except in case of an alias in L3 cache's hash function? But even then, pseudo-LRU eviction should keep the most-recent line reliably.)

static struct CACHE_ENV { ... } cacheEnv; 不保证按缓存行大小对齐;您缺少 C11 _Alignas(64) 或 GNU C __attribute__((aligned(64))).所以它可能跨越多于 sizeof(T)/64 行.或者,为了更好的衡量,将 L2 相邻行预取器对齐 128.(在这里你可以而且应该简单地对齐你的缓冲区,但是 使用函数 _mm_clflush 刷新大型结构的正确方法 展示了如何循环遍历任意大小的可能未对齐结构的每个缓存行.)

static struct CACHE_ENV { ... } cacheEnv; isn't guaranteed to be aligned by the cache line size; you're missing C11 _Alignas(64) or GNU C __attribute__((aligned(64))). So it might be spanning more than sizeof(T)/64 lines. Or for good measure, align by 128 for the L2 adjacent-line prefetcher. (Here you can and should simply align your buffer, but The right way to use function _mm_clflush to flush a large struct shows how to loop over every cache line of an arbitrary-sized possibly-unaligned struct.)

这并不能解释您的问题,因为唯一可能被遗漏的部分是 env.out 的最后 48 个字节.(我认为默认 ABI 规则下全局结构将按 16 对齐.)而且您只打印每个数组的前几个字节.

This doesn't explain your problem, since the only part that might get missed is the last up-to-48 bytes of env.out. (I think the global struct will get aligned by 16 by default ABI rules.) And you're only printing the first few bytes of each array.

顺便说一句,完成后通过 memset 用 0 覆盖您的缓冲区也可以防止您的数据像 INVD 一样可靠地写回 DRAM,但速度更快.(也许是通过 asm 手动rep stosb 以确保它不会优化为死商店).

And BTW, overwriting your buffer with 0 via memset after you're done should also keep your data from getting written back to DRAM about as reliably as INVD, but faster. (Maybe a manual rep stosb via asm to make sure it can't optimize away as a dead store).

无填充模式在这里也可能有用,以阻止缓存未命中驱逐现有行.AFAIK,这基本上锁定了缓存,因此不会发生新的分配,因此不会被驱逐.(但您可能无法读取或写入其他普通内存,尽管您可以将结果留在寄存器中.)

No-fill mode might also be useful here to stop cache misses from evicting existing lines. AFAIK, that basically locks down the cache so no new allocations will happen, and thus no evictions. (But you might not be able to read or write other normal memory, although you could leave a result in registers.)

无填充模式(对于当前内核)可以确保在重新启用分配之前使用 memset 清除缓冲区绝对安全;在导致驱逐期间没有缓存未命中的风险.尽管如果您的 fillCache 实际上工作正常,并且在您开始工作之前让您的所有线路进入 MESI Modified 状态,您的加载和存储将在 L1d 缓存中命中,而不会有驱逐任何缓冲线路的风险.

No-fill mode (for the current core) would make it definitely safe to clear your buffers with memset before re-enabling allocation; no risk of a cache miss during that causing an eviction. Although if your fillCache actually works properly and gets all your lines into MESI Modified state before you do your work, your loads and stores will hit in L1d cache without risk of evicting any of your buffer lines.

如果您担心 DRAM 内容(而不是总线信号),那么 在 memset 之后的每一行 clflushopt 将减少漏洞窗口.(或者如果 0 对您不起作用,则从原始副本的干净副本中进行 memcpy,但希望您可以只在私有副本中工作并且保持原稿不变.总是可以进行杂散回写使用您当前的方法,所以我不想依赖它来确保始终保持大缓冲区不变.)

If you're worried about DRAM contents (rather than bus signals), then clflushopt each line after memset will reduce the window of vulnerability. (Or memcpy from a clean copy of the original if 0 doesn't work for you, but hopefully you can just work in a private copy and leave the orig unmodified. A stray write-back is always possible with your current method so I wouldn't want to rely on it to definitely always leave a large buffer unmodified.)

不要将 NT 存储用于手动 memset 或 memcpy:这可能会刷新秘密"脏数据之前 NT 存储.一种选择是使用普通存储或 rep stosb memset(0),然后使用 NT 存储再次循环.或者,也许每行执行 8 次 movq 常规存储,然后执行 8 次 movnti,因此您可以在继续之前对同一行背靠背执行这两项操作.

Don't use NT stores for a manual memset or memcpy: that might flush the "secret" dirty data before the NT store. One option would be to memset(0) with normal stores or rep stosb, then loop again with NT stores. Or perhaps doing 8x movq normal stores per line, then 8x movnti, so you do both things to the same line back to back before moving on.

如果您没有使用无填充模式,那么在您写入它们之前 是否缓存这些行都无关紧要.你只需要在 invd 运行时你的写入在缓存中是脏的,即使它们是从你的缓存中丢失的存储中获得的,这也应该是正确的.

If you're not using no-fill mode, it shouldn't even matter whether the lines are cached before you write to them. You just need your writes to be dirty in cache when invd runs, which should be true even if they got that way from your stores missing in cache.

fillCachechangeFixedValue 之间已经没有像 mfence 这样的障碍,这很好,但意味着任何缓存未命中都在启动缓存时仍在运行你弄脏了它.

You already don't have any barrier like mfence between fillCache and changeFixedValue, which is fine but means that any cache misses from priming the cache are still in flight when you dirty it.

INVD 本身 是序列化,所以在丢弃缓存内容之前应该等待存储离开存储缓冲区.(所以将 mfence;lfence 放在你的工作之后,在 INVD 之前,应该没有任何区别.)换句话说,INVD 应该丢弃仍在存储缓冲区中的可缓存存储,以及脏缓存行,除非提交其中一些存储碰巧驱逐任何东西.

INVD itself is serializing, so it should wait for stores to leave the store buffer before discarding cache contents. (So putting mfence;lfence after your work, before INVD, shouldn't make any difference.) In other words, INVD should discard cacheable stores that are still in the store buffer, as well as dirty cache lines, unless committing some of those stores happens to evict anything.

这篇关于如何将结构显式加载到 L1d 缓存中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆