memcpy在Linux中移动128位 [英] memcpy moving 128 bit in linux

查看:161
本文介绍了memcpy在Linux中移动128位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Linux中为PCIe设备编写设备驱动程序.该设备驱动程序执行几次读取和写入操作以测试吞吐量.当我使用memcpy时, TLP 的最大有效载荷是8个字节(在64位架构).在我看来,获取16字节有效负载的唯一方法是使用SSE指令集.我已经看过,但是代码无法编译(AT& T/Intel语法问题).

I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ).

  • 有没有一种方法可以在Linux内部使用该代码?
  • 有人知道我在哪里可以找到能移动128位的memcpy的实现吗?

推荐答案

首先,您可能将GCC用作编译器,并且它将asm语句用于内联汇编程序.使用该方法时,您将必须使用字符串文字作为汇编代码(在发送给汇编程序之前,它将被复制到汇编代码中-这意味着该字符串应包含换行符).

First of all you probably use GCC as the compiler and it uses the asm statement for inline assembler. When using that you will have to use a string literal for the assembler code (which will be copied into the assembler code before sending to the assembler - this means that the string should contain newline characters).

第二,您可能必须对汇编程序使用AT& T语法.

Second you will probably have to use AT&T syntax for the assembler.

第三GCC使用扩展的asm 在汇编程序和C之间传递变量

Third GCC uses extended asm to pass variables between assembler and C.

第四,您还是应该尽可能避免使用内联汇编程序,因为编译器将不可能通过asm语句来调度指令(至少是这样).相反,您可以使用vector_size属性之类的GCC扩展名:

Fourth you should probably avoid inline assembler when possible anyway as the compiler wont have the possibility to schedule instructions past an asm statement (this was true at least). Instead you could maybe make use of GCC extensions like the vector_size attribute:

typedef float v4sf __attribute__((vector_size(16)));

void fubar( v4sf *p, v4sf* q )
{
  v4sf p0 = *p++;
  v4sf p1 = *p++;
  v4sf p2 = *p++;
  v4sf p3 = *p++;

  *q++ = p0;
  *q++ = p1;
  *q++ = p2;
  *q++ = p3;
}

的优点是,即使您为没有mmx寄存器但也许有其他128位寄存器(或根本没有向量寄存器)的处理器进行编译,编译器也会产生代码.

has the advantage that the compiler will produce code even if you compile for a processor that doesn't have the mmx registers, but perhaps some other 128-bit registers (or doesn't have vector registers at all).

第五,您应该调查提供的memcpy是否不够快.通常memcpy确实是经过优化的.

Fifth you should investigate if the provided memcpy isn't fast enough. Often the memcpy is really optimized.

第六,如果您在Linux内核中使用特殊寄存器,则应该采取预防措施,因为在上下文切换期间没有保存某些寄存器. SSE寄存器是其中的一部分.

Sixth you should take precaution if you're using special registers in the Linux kernel, there are registers that aren't saved during context switch. The SSE registers are a part of these.

使用此工具测试吞吐量的第七点,您应该考虑处理器是否是等式中的重要瓶颈.将代码的实际执行与对RAM的读取/写入(您是否要访问高速缓存或错过高速缓存?)或对外设的读取/写入进行比较.

Seventh as you using this to test throughput you should consider if the processor is a significant bottleneck in the equation. Compare the actual execution of the code with the reads from/writes to RAM (do you hit or miss the cache?) or the reads from/write to the peripheral.

第八,在移动数据时,应避免将大块数据从RAM移到RAM,如果是在带宽有限的外设之间进行的,则应绝对考虑使用DMA.请记住,如果访问时间限制了性能,则CPU仍将被视为繁忙(尽管它不能以100%的速度运行).

Eighth when moving data you should avoid moving big chunks of data from RAM to RAM and if it's to/from a peripheral that has limited bandwidth you should definitely consider using DMA for that. Remember that if it's access time that limits the performance the CPU will still be considered busy (although it can't run at 100% speed).

这篇关于memcpy在Linux中移动128位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆