ARM VS对iPhone 3GS的拇指性能,非浮点code [英] ARM vs Thumb performance on iPhone 3GS, non floating point code

查看:202
本文介绍了ARM VS对iPhone 3GS的拇指性能,非浮点code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人对ARM任何硬数字VS上iPhone 3GS的拇指code性能。特别是对于非浮点(VFP或NEON)code - 我所知道的与漂浮在Thumb模式运算性能的问题。

I was wondering if anyone had any hard numbers on ARM vs Thumb code performance on iPhone 3GS. Specifically for non-floating point (VFP or NEON) code - I'm aware of the issues with floating point performance in Thumb mode.

是否存在这样的更大的ARM指令额外的code尺寸成为性能危害的点?换句话说,如果我的可执行文件code相比,可用内存比较小,没有任何的测量的性能差异,以开启Thumb模式?

Is there a point where the extra code size of bigger ARM instructions becomes a performance hazard? In other words, if my executable code is relatively small compared to available memory, is there any measured performance difference to turning Thumb mode on?

我想问的原因是,虽然我可以启用对X code。使用-marm选项的NEON特定的源文件ARM公司,这打破了模拟器建立,因为海合会正在建设86。我不知道我是否应该只是关掉编译拇指,并用它做。

The reason I ask is that while I can enable ARM for the NEON specific source files in Xcode using the "-marm" option, this breaks the Simulator build because GCC is building x86. I was wondering whether I should just turn off "compile as thumb" and be done with it.

推荐答案

我不知道关于iPhone,但毯子声明拇指慢于ARM是不正确的。由于32位宽的零等待状态记忆,拇指会有点慢,像5%或10%的数字。现在,如果它是thumb2这是一个不同的故事,据说thumb2可以跑得更快,我不知道iPhone有我的猜测是,它不是thumb2。

如果您没有用完零等待状态,32位内存那么你的结果会有所不同。一个很大的事情是32位宽度的存储器。如果你是一个16位宽的总线像GameBoy Advance的家庭上运行,并有一些等待该内存或ROM状态,那么拇指可以很容易地出来竞选高性能ARM即使它需要更多的Thumb指令来执行相同的任务。

I dont know about the iPhone but a blanket statement that thumb is slower than ARM is not correct at all. Given 32 bit wide zero wait state memory, thumb will be a little slower, numbers like 5% or 10%. Now if it is thumb2 that is a different story, it is said that thumb2 can run faster, I dont know what the iPhone has my guess is that it is not thumb2.
If you are not running out of zero-wait-state 32 bit memory then your results will vary. One big thing is 32 bit wide memory. If you are running on a 16 bit wide bus like the GameBoy Advance family, and there are some wait states on that memory or ROM, then thumb can easily out run ARM for performance even though it takes more thumb instructions to perform the same task.

测试你的code!不难发明一种测试,为您提供有兴趣或没有结果。这是因为容易证明的手臂摇落大拇指,因为它是拇指吹走了手臂。谁在乎根据Dhrystones是什么,它​​是如何快速它运行code今天事项。

Test your code! It is not hard to invent a test that provides the results you are interested in or not. It is as easy to show arm blows away thumb as it is thumb blows away arm. Who cares what the dhrystones are, it is how fast does it run YOUR code TODAY that matters.

我已经找到了在ARM测试code性能的岁月是你的code和你的编译器是很重要的因素。所以拇指在理论上是慢的百分之几,因为它使用百分之几多个指令以[执行相同的任务。但是你知道,你喜欢的编译器可能是可怕的,通过简单的切换编译器,你可以更快地运行几次(GCC属于这一类)?或使用相同的编译器和混合了优化选项。无论哪种方式,您可以通过阴影聪明有关使用工具的ARM / Thumb差异。你可能知道这​​一点,但你会惊讶地知道有多少人认为的一种方式,他们知道如何编译code是唯一的方式来获得更好的性能,唯一的办法就是在这个问题投入更多的内存或其他硬件

What I have found over the years in testing code performance for ARM is that your code and your compiler are the big factor. So thumb is a few percent slower in theory because it uses a few percent more instructions to peform the same task. But did you know that your favorite compiler could be horrible and by simply switch compilers you could run several times faster (gcc falls into that category)? Or using the same compiler and mixing up the optimization options. Either way you can shadow the arm / thumb difference by being smart about using the tools. You probably know this but you would be surprised to know how many people think that the one way they know how to compile code is the only way and the only way to get better performance is throw more memory or other hardware at the problem.

如果你在iPhone我听到那些人正在使用LLVM?我喜欢在很多方面LLVM的概念,并很渴望当它成熟时用它作为我的日常驾驶,但发现它产生code,这是10-20%(或更多)慢于特定的任务,我在做什么。我是在ARM模式,我没有尝试拇指模式,我有一个上L1和L2缓存。如果我没有真正比较拇指武装我可能会看到拇指慢百分之几的高速缓存测试,但如果你认为它(这我不是当时的兴趣),你可以缓存的两倍拇指code比手臂code这可能意味着,即使有更多的百分之几code总的任务,通过缓存显著更多的它,并降低平均提取时间拇指可明显加快。我可能会去尝试。

If you are on the iPhone I hear those folks are using LLVM? I like the llvm concept in many ways and am eager to use it as my daily driver when it matures, but found it to produce code that was 10-20% (or much more) slower for the particular task I was doing. I was in arm mode, I did not try thumb mode, and I had an l1 and l2 cache on. Had I tested without the caches to truly compare thumb to arm I would probably see thumb a few percent slower, but if you think of it (which I wasnt interested in at the time) you can cache twice as much thumb code than arm code which MIGHT imply that even though there is a few percent more code overall for the task, by caching significantly more of it and reducing the average fetch time thumb can be noticeably faster. I may have to go try that.

如果您使用的是LLVM,你有多个地方的其他问题进行优化。从C将字节code可以优化,则可以优化字节code本身,然后你可以合并所有的字节code,优化,作为一个整体,然后从字节$去的时候C $ C到汇编程序可以优化。如果你也只有3源文件,并且假设有每个机会只有两个优化级别,这些是不要优化或做优化,用gcc你有8个组合进行测试,与LLVM的实验数量级更高的几乎是以命令。比你更可以真正运行,几百到几千。对于一个测试,我跑,在C到字节code步骤不opimizing,再不行优化字节code,而分开的,但该字节code文件合并成一个大的(德国)后优化一。途中有LLC优化来武装产生最好的结果。

If you are using llvm, you have the other problem of multiple places to perform optimizations. Going from C to bytecode you can optimize, you can then optimize the bytecode itself, you can then merge all of your bytecode and optimize that as a whole, then when going from byte code to assembler you can optimize. If you had only 3 source files, and assumed there were only two optimization levels per opportunity, those being dont optimize or do optimize, with gcc you would have 8 combinations to test, with llvm the number of experiments is almost an order of magnitude higher. More than you can really run, hundreds to thousands. For the one test I was running, NOT opimizing on the C to bytecode step, then NOT optimizing the bytecode while separate, but optimizing after merging the bytecode files into one big(ger) one. The having llc optimize on the way to arm produced the best results.

底线...测试,测试,测试。

Bottom line...test,test,test.

编辑:

我已经用这个词字节code是,我认为正确的说法是在LLVM世界位code。在.BC文件中的code是我的意思...

I have been using the word bytecode, I think the correct term is bitcode in the LLVM world. The code in the .bc files is what I mean...

如果您是从C将使用LLVM到ARM,有中间位code(BC)。有关于C至公元前步优化命令行选项。公元前后可以在每个文件进行优化,公元前公元前。如果您选择您可以合并两个或多个文件BC成更大的公元前文件,或者只是把所有的文件合并成一个大的BC文件。然后每个这些组合的文件也可以被优化。

If you are going from C to ARM using LLVM, there is bitcode (bc) in the middle. There are command line options for optimizing on the C to bc step. Once bc you can optimize per file, bc to bc. If you choose you can merge two or more bc files into bigger bc files, or just turn all the files into one big bc file. Then each of these combined files can also be optimized.

我的理论,只有有几个这么远的测试用例是,如果你,直到你有一个大的BC文件的完整计划/项目没有做任何优化,优化器的最高金额,如果资料与完成其工作。因此,这意味着从C到公元前无优化。然后所有的BC文件合并成一个大文件的BC。一旦你的整个事情作为一个大文件BC然后让优化执行其优化步骤,最大限度地发挥信息和最优化的希望质量。然后,从公元前优化文件ARM汇编去。为有限责任公司的默认设置是对,你想允许的优化,因为它是一个知道如何为目标优化的唯一步骤的优化。公元前公元前优化是通用的,针对具体的(据我所知)。

My theory, that only has a couple of test cases behind it so far, is that if you do not do any optimization until you have the entire program/project in one big bc file, the optimizer has the maximum amount if information with which to do its job. So that means go from C to bc with no optimization. Then merge all the bc files into one big bc file. Once you have the whole thing as one big bc file then let the optimizer perform its optimization step, maximizing the information and hopefully quality of the optimization. Then go from the optimized bc file to ARM assembler. The default setting for llc is with optimization on, you do want to allow that optimization as it is the only step that knows how to optimize for the target. The bc to bc optimizations are generic and not target specific (AFAIK).

您还是要测试,测试,测试。来吧,与步骤之间的优化试验,看看它使你的程序运行速度更快或更慢。

You still have to test, test, test. Go ahead and experiment with optimizations between the steps, see if it makes your program run faster or slower.

这篇关于ARM VS对iPhone 3GS的拇指性能,非浮点code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆