iPhone 3GS 上的 ARM 与 Thumb 性能,非浮点代码 [英] ARM vs Thumb performance on iPhone 3GS, non floating point code

查看:30
本文介绍了iPhone 3GS 上的 ARM 与 Thumb 性能,非浮点代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人对 iPhone 3GS 上的 ARM 与 Thumb 代码性能有任何确切的数字.专门针对非浮点(VFP 或 NEON)代码 - 我知道 Thumb 模式下的浮点性能问题.

较大的 ARM 指令的额外代码大小是否存在性能风险?换句话说,如果我的可执行代码与可用内存相比相对较小,那么打开 Thumb 模式是否有任何测量的性能差异?

我问的原因是,虽然我可以使用-marm"选项在 Xcode 中为 NEON 特定源文件启用 ARM,但这会破坏模拟器构建,因为 GCC 正在构建 x86.我想知道我是否应该关闭编译为拇指"并完成它.

解决方案

我对 iPhone 一无所知,但总的说拇指比 ARM 慢的说法根本不正确.给定 32 位宽的零等待状态内存,thumb 会慢一点,像 5% 或 10% 这样的数字.现在如果是thumb2就另当别论了,据说thumb2可以跑得更快,我不知道iPhone有什么我的猜测是不是thumb2.
如果您没有用完零等待状态 32 位内存,那么您的结果会有所不同.一件大事是 32 位宽的内存.如果您在 GameBoy Advance 系列这样的 16 位宽总线上运行,并且在该内存或 ROM 上存在一些等待状态,那么即使执行相同任务需要更多的拇指指令,thumb 也可以轻松地运行 ARM 以提高性能.

测试你的代码!发明一种测试来提供您感兴趣或不感兴趣的结果并不难.显示手臂吹走拇指就像拇指吹走手臂一样容易.谁在乎 dhrystones 是什么,重要的是它今天运行你的代码的速度有多快.

多年来,我在测试 ARM 代码性能时发现,您的代码和编译器是重要因素.因此,thumb 理论上要慢几个百分点,因为它使用多几个百分点的指令来执行相同的任务.但是您是否知道您最喜欢的编译器可能很糟糕,并且通过简单地切换编译器,您可以将运行速度提高几倍(gcc 属于该类别)?或者使用相同的编译器并混合优化选项.无论哪种方式,您都可以通过巧妙地使用工具来掩盖手臂/拇指的差异.您可能知道这一点,但您会惊讶地发现,有多少人认为他们知道如何编译代码的一种方法是唯一的方法,并且获得更好性能的唯一方法是在问题上投入更多内存或其他硬件.

如果您使用 iPhone,我听说那些人在使用 LLVM?我在很多方面都喜欢 llvm 概念,并且渴望在它成熟时将其用作我的日常驱动程序,但发现它生成的代码对于我正在执行的特定任务要慢 10-20%(或更多).我处于手臂模式,我没有尝试拇指模式,而且我打开了 l1 和 l2 缓存.如果我在没有缓存的情况下进行测试以真正将拇指与手臂进行比较,我可能会发现拇指慢了几个百分点,但是如果您考虑一下(当时我不感兴趣),您可以缓存两倍于 arm 代码的拇指代码可能意味着即使任务的整体代码多几个百分点,通过缓存更多的代码并减少平均获取时间,thumb 可以明显更快.我可能得去试试那个.

如果您使用的是 llvm,您还有另一个问题,即多处执行优化.从 C 到字节码,您可以优化,然后您可以优化字节码本身,然后您可以合并所有字节码并将其作为一个整体进行优化,然后从字节码到汇编程序时您可以优化.如果你只有 3 个源文件,并假设每个机会只有两个优化级别,那些不优化或做优化,使用 gcc 你将有 8 个组合来测试,使用 llvm 的实验数量几乎高出一个数量级.比你真正能跑的多,成百上千.对于我正在运行的一项测试,不是在 C 到字节码步骤上进行优化,然后不是在分离时优化字节码,而是在将字节码文件合并为一个大(ger)文件后进行优化.让 llc 在 arm 的过程中进行优化产生了最好的结果.

底线...测试,测试,测试.

我一直在使用字节码这个词,我认为正确的术语是 LLVM 世界中的位码..bc 文件中的代码就是我的意思...

如果您要使用 LLVM 从 C 转到 ARM,则中间有位码 (bc).有用于优化 C 到 bc 步骤的命令行选项.一旦 bc 您可以优化每个文件,从 bc 到 bc.如果您选择,您可以将两个或多个 bc 文件合并为更大的 bc 文件,或者只是将所有文件变成一个大的 bc 文件.然后还可以优化这些组合文件中的每一个.

我的理论,到目前为止只有几个测试用例,如果你不做任何优化,直到你在一个大 bc 文件中拥有整个程序/项目,优化器拥有最大数量的 if 信息用它来完成它的工作.所以这意味着从 C 到 bc 没有优化.然后将所有 bc 文件合并为一个大 bc 文件.一旦你把整个事情作为一个大的 bc 文件,然后让优化器执行它的优化步骤,最大化信息和优化的质量.然后从优化的 bc 文件转到 ARM 汇编程序.llc 的默认设置是启用优化,您确实希望允许该优化,因为这是知道如何针对目标进行优化的唯一步骤.bc 到 bc 的优化是通用的,而不是针对特定目标 (AFAIK).

你仍然需要测试,测试,再测试.继续尝试在步骤之间进行优化,看看它是否会使您的程序运行得更快或更慢.

I was wondering if anyone had any hard numbers on ARM vs Thumb code performance on iPhone 3GS. Specifically for non-floating point (VFP or NEON) code - I'm aware of the issues with floating point performance in Thumb mode.

Is there a point where the extra code size of bigger ARM instructions becomes a performance hazard? In other words, if my executable code is relatively small compared to available memory, is there any measured performance difference to turning Thumb mode on?

The reason I ask is that while I can enable ARM for the NEON specific source files in Xcode using the "-marm" option, this breaks the Simulator build because GCC is building x86. I was wondering whether I should just turn off "compile as thumb" and be done with it.

解决方案

I dont know about the iPhone but a blanket statement that thumb is slower than ARM is not correct at all. Given 32 bit wide zero wait state memory, thumb will be a little slower, numbers like 5% or 10%. Now if it is thumb2 that is a different story, it is said that thumb2 can run faster, I dont know what the iPhone has my guess is that it is not thumb2.
If you are not running out of zero-wait-state 32 bit memory then your results will vary. One big thing is 32 bit wide memory. If you are running on a 16 bit wide bus like the GameBoy Advance family, and there are some wait states on that memory or ROM, then thumb can easily out run ARM for performance even though it takes more thumb instructions to perform the same task.

Test your code! It is not hard to invent a test that provides the results you are interested in or not. It is as easy to show arm blows away thumb as it is thumb blows away arm. Who cares what the dhrystones are, it is how fast does it run YOUR code TODAY that matters.

What I have found over the years in testing code performance for ARM is that your code and your compiler are the big factor. So thumb is a few percent slower in theory because it uses a few percent more instructions to peform the same task. But did you know that your favorite compiler could be horrible and by simply switch compilers you could run several times faster (gcc falls into that category)? Or using the same compiler and mixing up the optimization options. Either way you can shadow the arm / thumb difference by being smart about using the tools. You probably know this but you would be surprised to know how many people think that the one way they know how to compile code is the only way and the only way to get better performance is throw more memory or other hardware at the problem.

If you are on the iPhone I hear those folks are using LLVM? I like the llvm concept in many ways and am eager to use it as my daily driver when it matures, but found it to produce code that was 10-20% (or much more) slower for the particular task I was doing. I was in arm mode, I did not try thumb mode, and I had an l1 and l2 cache on. Had I tested without the caches to truly compare thumb to arm I would probably see thumb a few percent slower, but if you think of it (which I wasnt interested in at the time) you can cache twice as much thumb code than arm code which MIGHT imply that even though there is a few percent more code overall for the task, by caching significantly more of it and reducing the average fetch time thumb can be noticeably faster. I may have to go try that.

If you are using llvm, you have the other problem of multiple places to perform optimizations. Going from C to bytecode you can optimize, you can then optimize the bytecode itself, you can then merge all of your bytecode and optimize that as a whole, then when going from byte code to assembler you can optimize. If you had only 3 source files, and assumed there were only two optimization levels per opportunity, those being dont optimize or do optimize, with gcc you would have 8 combinations to test, with llvm the number of experiments is almost an order of magnitude higher. More than you can really run, hundreds to thousands. For the one test I was running, NOT opimizing on the C to bytecode step, then NOT optimizing the bytecode while separate, but optimizing after merging the bytecode files into one big(ger) one. The having llc optimize on the way to arm produced the best results.

Bottom line...test,test,test.

EDIT:

I have been using the word bytecode, I think the correct term is bitcode in the LLVM world. The code in the .bc files is what I mean...

If you are going from C to ARM using LLVM, there is bitcode (bc) in the middle. There are command line options for optimizing on the C to bc step. Once bc you can optimize per file, bc to bc. If you choose you can merge two or more bc files into bigger bc files, or just turn all the files into one big bc file. Then each of these combined files can also be optimized.

My theory, that only has a couple of test cases behind it so far, is that if you do not do any optimization until you have the entire program/project in one big bc file, the optimizer has the maximum amount if information with which to do its job. So that means go from C to bc with no optimization. Then merge all the bc files into one big bc file. Once you have the whole thing as one big bc file then let the optimizer perform its optimization step, maximizing the information and hopefully quality of the optimization. Then go from the optimized bc file to ARM assembler. The default setting for llc is with optimization on, you do want to allow that optimization as it is the only step that knows how to optimize for the target. The bc to bc optimizations are generic and not target specific (AFAIK).

You still have to test, test, test. Go ahead and experiment with optimizations between the steps, see if it makes your program run faster or slower.

这篇关于iPhone 3GS 上的 ARM 与 Thumb 性能,非浮点代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆