ARM NEON 编码:如何开始? [英] Coding for ARM NEON: How to start?

查看:29
本文介绍了ARM NEON 编码:如何开始?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景(如果您愿意,请跳过此部分)

首先让我说我不是专家程序员.我是一名年轻的初级计算机视觉 (CV) 工程师,我在 C++ 编程方面相当有经验,主要是因为广泛使用了伟大的 OpenCV2 C++ API.我所学到的只是执行项目的需要、解决问题和按时完成的需要,因为这是行业的现实.

Let me start by saying that I am no expert programmer. I am a young junior computer vision (CV) engineer, and I am fairly experienced in C++ programming mainly because of an extensive use of the great OpenCV2 C++ API. All I've learned was through the need to execute projects, the need to solve problems and meet deadlines, as it is the reality in the industry.

最近,我们开始为嵌入式系统(ARM 板)开发 CV 软件,我们使用纯 C++ 优化代码来完成.然而,与传统计算机相比,由于其资源有限,在这种架构下构建实时简历系统是一个巨大的挑战.

Recently, we started developing CV software for embedded systems (ARM boards), and we do it using plain C++ optimized code. However, it is a huge challenge to build a real-time CV system in this kind of architecture due to its limited resources when compared to traditional computers.

那是我发现 NEON 的时候.我读过很多关于这个的文章,但这是一个相当新的主题,所以没有太多关于它的信息,而且我读得越多,我就越困惑.

Thats when I found about NEON. I've read a bunch of articles about this, but this is a fairly recent theme, so there isn't much information about it and the more I read, the more confused I get.

问题

我希望使用一次计算 4 或 8 个数组元素的 NEON 功能来优化 C++ 代码(主要是一些 for 循环).是否有某种库或一组函数可以在 C++ 环境中使用?我困惑的主要来源是我看到的几乎所有代码片段都在汇编中,我完全没有背景,并且此时不可能学习.我在 Linux Gentoo 中使用 Eclipse IDE 编写 C++ 代码.

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment? The main source of my confusion is the fact that almost all code snipets I see are in Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point. I use Eclipse IDE in Linux Gentoo to write C++ code.

更新

阅读答案后,我对软件进行了一些测试.我使用以下标志编译了我的项目:

After reading the answers I did some tests with the software. I compiled my project with the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon 

请记住,该项目包含大量库,例如 openframeworks、OpenCV 和 OpenNI,并且所有内容都是使用这些标志编译的.我们使用 Linaro 工具链交叉编译器为 ARM 板编译,GCC 的版本是 4.8.3.您是否希望这会提高项目的绩效?因为我们根本没有经历任何变化,考虑到我在这里读到的所有答案,这很奇怪.

Keep in mind that this project include extensive libraries such as openframeworks, OpenCV and OpenNI, and everything was compiled with these flags. To compile for the ARM board we use a Linaro toolchain crosscompiler, and GCC's version is 4.8.3. Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.

另一个问题:所有 for 循环 都有明显的迭代次数,但其中许多迭代通过自定义数据类型(结构或类).即使循环遍历自定义数据类型,GCC 能否优化这些循环?

Another question: all the for cycles have an apparent number of iteratons, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

推荐答案

从您的更新中,您可能会误解 NEON 处理器的作用.它是一个 SIMD(单指令多数据)矢量处理器.这意味着它非常擅长同时对多条数据执行一条指令(比如乘以 4").它还喜欢做诸如将所有这些数字加在一起"或将这两个数字列表的每个元素相加以创建第三个数字列表"之类的事情.因此,如果您的问题看起来像那些问题,NEON 处理器将提供巨大帮助.

From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.

要获得这种好处,您必须将数据放入非常特定的格式,以便矢量处理器可以同时加载多个数据,并行处理,然后同时将其写回.您需要组织事物,以便数学避免大多数条件(因为过早查看结果意味着往返于 NEON).矢量编程是一种不同的思考程序的方式.一切都与管道管理有关.

To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.

现在,对于许多非常常见的问题,编译器可以自动解决所有这些问题.但它仍然是关于处理数字和特定格式的数字.例如,您几乎总是需要将所有数字放入内存中的一个连续块中.如果您正在处理结构和类中的字段,NEON 真的帮不上忙.它不是一个通用的并行执行"引擎.这是一个用于进行并行数学运算的 SIMD 处理器.

Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.

对于非常高性能的系统,数据格式就是一切.您不会采用任意的数据格式(结构、类等)并尝试使它们快速.您找出可以让您完成最多并行工作的数据格式,然后围绕它编写代码.你让你的数据是连续的.您不惜一切代价避免内存分配.但这并不是一个简单的 StackOverflow 问题可以解决的问题.高性能编程是一整套技能和不同的思考方式.这不是通过找到正确的编译器标志而获得的.正如您所发现的,默认设置已经很不错了.

For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.

您应该问的真正问题是您是否可以重新组织您的数据,以便您可以使用更多的 OpenCV.OpenCV 已经有许多优化的并行操作,几乎肯定会很好地利用 NEON.您希望尽可能以 OpenCV 工作的格式保留数据.这可能是您获得最大改进的地方.

The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.

我的经验是,手写 NEON 程序集当然可以击败 clang 和 gcc(至少从几年前开始,尽管编译器肯定会继续改进).拥有出色的 ARM 优化与 NEON 优化不同.正如@Mats 指出的那样,编译器通常会在明显的情况下做得很好,但并不总是理想地处理所有情况,即使是技术娴熟的开发人员也有可能有时会击败它,有时甚至是戏剧性的.(@wallyk 也是正确的,手动调整程序集最好留到最后;但它仍然可以非常强大.)

My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As @Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (@wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)

也就是说,鉴于您的陈述Assembly,我完全没有背景,而且目前不可能负担得起",那么不,您甚至不应该费心.如果不首先了解汇编(特别是矢量化 NEON 汇编)的基础知识(和一些非基础知识),就没有必要对编译器进行二次猜测.击败编译器的第一步是了解目标.

That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.

如果你愿意学习目标,我最喜欢的介绍是ARM Assembly旋风之旅.加上一些其他参考资料(如下),足以让我在我的特定问题中以 2-3 倍的优势击败编译器.另一方面,它们还不够,以至于当我向一位经验丰富的 NEON 开发人员展示我的代码时,他看了大约三秒钟并说你在那里停下来了".真正好的汇编很难,但半体面的汇编仍然比优化的 C++ 更好.(同样,随着编译器编写者的进步,这种情况每年都变得不那么真实,但它仍然可能是真实的.)

If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)

  • ARM Assembly language
  • A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
  • ARM NEON support in the ARM compiler
  • Coding for NEON

附注,我对 NEON 内在函数的体验是他们很少值得麻烦.如果您要击败编译器,您将需要实际编写完整的程序集.大多数情况下,无论您使用什么内在函数,编译器都已经知道了.您获得权力的地方通常是在重组您的循环以最好地管理您的管道(而内在函数在那里无济于事).这可能在过去几年中有所改善,但我希望改进的向量优化器比其他方式更能超过内在函数的价值.

One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

这篇关于ARM NEON 编码:如何开始?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆