如何编写代码以提示JVM使用向量操作? [英] How can I write code to hint to the JVM to use vector operations?

查看:148
本文介绍了如何编写代码以提示JVM使用向量操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有些相关的问题和一岁:是否有任何JVM的JIT编译器生成使用向量化浮点指令的代码?

Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

前言:我试图在纯java中执行此操作(没有JNI到C ++,没有GPGPU工作等...)。我已经进行了分析,并且大部分处理时间来自此方法中的数学运算(可能是95%的浮点数学运算和5%的整数运算)。我已经将所有Math.xxx()调用减少到足够好的近似值,因此大部分数学运算现在都是浮点数乘以一些加法。

Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.

我有一些处理音频处理的代码。我一直在进行调整,并且已经获得了巨大的收益。现在我正在研究手动循环展开以查看是否有任何好处(至少手动展开2,我看到大约25%的改进)。在尝试手动展开4时(由于我展开嵌套循环的两个循环,这开始变得非常复杂)我想知道是否有任何我可以做的提示到jvm在运行时它可以使用向量操作(例如SSE2,AVX等......)。音频的每个样本都可以完全独立于其他样本计算,这就是为什么我已经能够看到25%的改进(减少浮点计算的依赖程度)。

I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).

例如,我有4个浮点数,一个用于循环的4个unrolls中的每一个以保存部分计算的值。我如何声明和使用这些浮子很重要吗?如果我把它变成一个浮点数[4],它会向jvm暗示它们彼此无关,而不是浮动,浮动,浮动,浮动甚至是一类4个公共浮标?有什么我可以做的事情没有意义,这将会破坏我的代码被矢量化的机会吗?

For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?

我在网上看到有关正常编写代码的文章,因为编译器/ jvm知道常见模式以及如何优化它们并偏离模式可能意味着更少的优化。但至少在这种情况下,我不会期望将循环展开2以提高性能,因为我知道我还能做什么(或者至少做)帮助我的机会。我知道编译器/ jvm只会变得更好所以我也要警惕做将来会伤害我的事情。

I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.

为好奇编辑:通过另一个展开4次性能扩展 ~2%展开2次,所以我真的认为如果jvm支持它(或者可能已经 使用它们。)

Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).

谢谢!

推荐答案


我怎么能......音频处理......纯java(没有JNI到C ++,没有GPGPU工作等等)..使用向量操作(例如SSE2,AVX等......)

How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)

Java是高级别语言(Java中的一条指令生成许多硬件指令),它是按设计的(例如垃圾收集器内存管理),不适合实时操作高数据量的任务。

Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.

通常有specia l针对特定角色优化的硬件(例如图像处理语音识别)多次通过几条简化的处理流水线进行并行化。

There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.

此类还有一些特殊的编程语言任务,主要是硬件描述语言汇编语言

There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.

即使C ++(被认为是快速语言)也不会自动使用一些超级优化的硬件操作为了你。它可能只是在某些地方内联几种手工制作的汇编语言方法之一。

Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.

所以我的回答是可能没办法指示JVM对你的代码使用一些硬件优化(例如 SSE )即使有一些,Java语言运行时仍会有太多其他因素会减慢 - 关闭你的代码。

So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.

使用专为此任务设计的低级语言,并将其链接到Java以获得高级逻辑。

Use a low-level language designed for this task and link it to the Java for high-level logic.

编辑:根据评论添加更多信息

如果你确信这么高-level一次编写运行语言运行时肯定也应该为您做很多低级优化,然后自动将高级代码转换为优化的低级代码......然后JIT编译器优化的方式取决于实现 Java Virtual Ma折角。其中有很多。

If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.

如果是Oracle JVM(HotSpot),您可以通过下载源代码,文本 SSE2 出现在以下文件中:

In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:


  • openjdk / hotspot / src / cpu / x86 / vm / assembler_x86.cpp

  • openjdk / hotspot / src / cpu / x86 / vm /assembler_x86.hpp

  • openjdk / hotspot / src / cpu / x86 / vm / c1_LIRGenerator_x86.cpp

  • openjdk / hotspot / src / cpu / x86 /vm/c1_Runtime1_x86.cpp

  • openjdk / hotspot / src / cpu / x86 / vm / sharedRuntime_x86_32.cpp

  • openjdk / hotspot / src / cpu /x86/vm/vm_version_x86.cpp

  • openjdk / hotspot / src / cpu / x86 / vm / vm_version_x86.hpp

  • openjdk / hotspot / src /cpu/x86/vm/x86_32.ad

  • openjdk / hotspot / src / os_cpu / linux_x86 / vm / os_linux_x86.cpp

  • openjdk / hotspot /src/share/vm/c1/c1_GraphBuilder.cpp

  • openjdk / hotspot / src / share / vm / c1 / c1_LinearScan.cpp

  • openjdk / hotspot / src / share / vm / runtime / globals.hpp

  • openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
  • openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
  • openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
  • openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
  • openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
  • openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
  • openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
  • openjdk/hotspot/src/share/vm/runtime/globals.hpp

它们是C ++和汇编语言所以无论如何,你必须学习一些低级语言来阅读它们。

They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.

即使有+500赏金,我也不会那么深。恕我直言,根据错误的假设,这个问题是错误的

I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions

这篇关于如何编写代码以提示JVM使用向量操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆