编码ARM NEON:如何开始? [英] Coding for ARM NEON: How to start?

查看:346
本文介绍了编码ARM NEON:如何开始?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景(如果你喜欢跳过这个)

让我说,我不是专家,程序员开始。我是一个年轻小辈计算机视觉(CV)的工程师,而我用C相当有经验++编程,主要是因为大量使用大OpenCV2 C ++的API。所有我已经学会了通过需要执行的项目,需要解决的问题,并满足最后期限,因为这是行业的现实。

Let me start by saying that I am no expert programmer. I am a young junior computer vision (CV) engineer, and I am fairly experienced in C++ programming mainly because of an extensive use of the great OpenCV2 C++ API. All I've learned was through the need to execute projects, the need to solve problems and meet deadlines, as it is the reality in the industry.

最近,我们开始开发软件CV为嵌入式系统(ARM板),我们用它简单的C ++优化code做的。然而,相对于传统的电脑时,它是建立在这种架构的实时CV系统的一个巨大的挑战,因为其有限的资源。

Recently, we started developing CV software for embedded systems (ARM boards), and we do it using plain C++ optimized code. However, it is a huge challenge to build a real-time CV system in this kind of architecture due to its limited resources when compared to traditional computers.

这就是当我发现大约NEON。我读过了一堆关于这个的文章,但是这是一个相当新的主题,所以没有这件事,我读的越多,越糊涂,我得到的信息。

Thats when I found about NEON. I've read a bunch of articles about this, but this is a fairly recent theme, so there isn't much information about it and the more I read, the more confused I get.

问题

我在寻找优化C ++ code(主要是一些的 for循环的),使用计算在一个时间4或8数组元素的NEON能力。是否有某种类型的库或一组功能,可以在C ++环境下使用?我的困惑的主要来源的事实是,几乎所有的code snipets我看到的是在大会,对此我绝对没有背景的,不能买不起学习在这一点上。
我使用的Eclipse IDE在Linux中的Gentoo写C ++ code。

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment? The main source of my confusion is the fact that almost all code snipets I see are in Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point. I use Eclipse IDE in Linux Gentoo to write C++ code.

更新

阅读答案后,我做了一些测试,该软件。我编译我的项目有以下标志:

After reading the answers I did some tests with the software. I compiled my project with the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon 

请该项目包括广泛的库如了openFrameworks,OpenCV的和OpenNI,一切都与这些标志编译。
要编译为ARM板,我们使用的工具链Linaro的交叉编译器,以及GCC的版本是4.8.3。
你会想到这改善项目的绩效?因为我们没有经历的变化可言,这是相当奇怪考虑所有我读到这里的答案。

Keep in mind that this project include extensive libraries such as openframeworks, OpenCV and OpenNI, and everything was compiled with these flags. To compile for the ARM board we use a Linaro toolchain crosscompiler, and GCC's version is 4.8.3. Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.

另一个问题是:所有的的循环的有iteratons的一个明显的数字,但其中许多人通过自定义数据类型(结构或类)迭代。 GCC可以优化这些周期,即使他们通过自定义数据类型迭代?

Another question: all the for cycles have an apparent number of iteratons, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

推荐答案

编辑:

从您的更新,你可能误解了NEON处理器做什么。它是一个SIMD(单指令多数据)向量处理器。这意味着,它是在同时执行的指令(例如乘以4),以几片数据的非常好。它还喜欢做这样的事情增加所有这些数字一起或加号的这两个列表中的每个元素打造数字的三分之一列表。所以,如果你的问题看起来像那些事的NEON处理器将是巨大的帮助。

From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.

要得到好处,你必须把你的数据在非常特殊的格式,以便使矢量处理器可以同时加载多个数据,处理它并行,然后它同时回信。你需要整理东西,这样算算就避免了条件语句(因为看结果太快意味着往返于NEON)。矢量编程是思考你的程序的不同的方式。这是所有关于管道管理。

To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.

现在,对于许多非常常见的各种问题,编译器会自动工作,可这一切了。但它仍然对与数字,特别是数字格式的工作。例如,你几乎总是需要让所有的数字成一个连续的内存块。如果你正在处理结构和类的内部领域,NEON不能真正帮助你。这不是一个通用引擎并行做的东西。这是做平行运算的SIMD处理器。

Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.

对于非常高性能的​​系统,数据格式就是一切。你不采取任意数据格式(结构,类等),并尝试让他们快。你计算出的数据格式,可以让你做最并行工作,而你写的周围,你的code。你让你的数据是连续的。你不惜一切代价避免内存分配。但是,这真的不是一个简单的计算器的问题能够解决。高性能的编程是一个整体的技能和思考事情的方式不同。这是不是你能找出合适的编译器标志得到的东西。正如您看到的,默认值是pretty好了。

For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.

您应该问真正的问题是,你是否能重新组织数据,以便您可以使用更多的OpenCV。 OpenCV的已经有很多优化的并行操作,几乎肯定会利用好NEON的。尽可能多的,你要保持你的数据在OpenCV的作品在形式,这是可能的,你会得到你的最大改进。

The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.

我的经验是,它是绝对有可能手工编写NEON大会,将击败铛和gcc(至少从几年前,虽然肯定编译持续改善)。具有优异的ARM优化是不一样的NEON优化。作为@Mats笔记,编译器会一般在明显的情况下做了出色的工作,但并不总是理想的处理每一个情况下,它肯定是可能的,甚至轻度熟练的开发人员有时会战胜它,有时显着。 (@wallyk也是正确的,手动调节组件保存最好的去年。但它仍然是非常强大)

My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As @Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (@wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)

也就是说,鉴于你的声明大会,对此我绝对没有背景的,不能买不起学习在这一点上,那么没有,你不应该甚至不屑。如果没有至少第一理解组件(特别NEON矢量组装)的基础知识(和一些非基本的),还有第二猜测编译器是没有意义的。步骤跳动的编译器知道目标之一。

That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.

如果你愿意学习目标,我最喜欢的介绍是ARM大会旋风之旅。也就是说,加上其他一些引用(如下图),就足以让我在我的特殊问题由2-3倍击败编译器。在另一方面,他们够不够,当我发现我的code有经验的开发人员NEON,他看着它约三秒钟,说:你必须停止在那里。真正好的组装是很难的,但半像样的装配仍可以比优化的C ++更好。 (同样,因为编译器编写者得到更好的每年的这得那么真实,但它仍然是正确的。)

If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)

  • ARM Assembly language
  • A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
  • ARM NEON support in the ARM compiler
  • Coding for NEON

一个侧面说明,<一个href=\"http://stackoverflow.com/questions/9828567/arm-neon-intrinsics-vs-hand-assembly/9829272#9829272\">my与NEON内在经验是,他们很少值得的麻烦。如果你要打败编译器,你将需要实际写满组装。大多数时候,你会用什么内在的,编译器已经知道。你在哪里得到你的权力是多发于重组的循环的最佳管理管道(及内部函数不帮助那里)。这有可能这已在过去几年改善了,但我希望在提高向量优化到超过内在价值比其他方式多。

One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

这篇关于编码ARM NEON:如何开始?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆