汇编程序任务:翻译 [英] Assembler task : translations

查看:125
本文介绍了汇编程序任务:翻译的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么L21(比.L27长)比.L27快?

Why.L21 (which is longer than .L27) is faster than .L27?

为什么标志-funroll-loops加快了loop1的速度却没有加快loop2的速度?

Why flag -funroll-loops speeds up loop1 but doesn't speed up loop2?

推荐答案

无论如何,关于您的问题...第一个循环在循环内部没有依赖关系,即循环的每个迭代都是独立的,可以尽快进行计算(实际上,除了最后一次迭代外,所有其他变量都可以被丢弃,因为它们根本不影响返回值.

Anyway, about your question... the first loop doesn't have dependency inside the loop, i.e. each iteration of loop is independent and can be calculated ASAP (actually all except the last iteration can be just thrown away, because they don't affect return value at all).

每个迭代的第二个循环取决于先前的结果,因此CPU必须等待每个下一个imul,直到先前的结果准备好为止.我猜现代x86上的imul吞吐量仍然约为1.0,但是延迟可能高于1.0,并且不确定依赖项将做什么,这完全取决于您未指定的目标CPU平台. (像Peter Cordes这样的人肯定可以针对特定的现代Intel微型体系结构回答这个问题,或者您可以阅读Agner的表格,但是由于您未指定目标体系结构,因此对于我而言,我不认为要提出任何特定的实际示例一般的聊天级别就足够了)

The second loop for each iteration depends on previous result, so the CPU has to wait with each next imul until the previous result is ready. The imul on modern x86 has still throughput about 1.0 I guess, but the latency may be above 1.0 and not sure what the dependency will do, depends completely on your target CPU platform, which you didn't specify. (somebody like Peter Cordes can surely answer this for particular modern Intel micro architectures, or you can read yourself Agner's tables, but as you didn't specify target architecture, I don't see point in making any particular real world example, for me this general chit-chat level is enough)

例如,在80386上,我猜想第二个循环会更快,因为它的指令更少,而且80386仍然很简单",在任何情况下imul都占用多个时钟.在最新的Intel CPU上,依赖关系可能会偏向于偏向于第一个,但并不多,因为imul到今天已经相当快了.

For example on 80386 I guess the second loop would be faster, because it has less instructions, and 80386 was still quite "simple" inside, with imul taking several clocks in either case. On latest Intel CPUs the dependency will probably just so-so skew it in favour of first one, but not much, as imul is reasonably fast today.

无论如何,这是一个很好的示例,如何首先整理算法并对其进行调整,将为您带来最大的性能提升,因为第一个循环不是真正的循环,并将其编写为简单公式可以使代码更快

Anyway, this is nice example how sorting out your algorithm first, and tuning that, will give you the biggest performance gain, as the first loop is not a real loop, and writing it as simple formula will make the code even faster.

很奇怪,我在Godbolt Explorer中尝试过,现代编译器对此做了什么,gcc做了一些相当复杂的事情来读取每个数组成员,或者那条指令墙到底做了什么(太懒了,无法在调试器中检查) ,而clang编译器确实会查看并生成简化的公式: https://godbolt.org/g/p2MGHs

Curiously enough, I tried in godbolt explorer, what modern compilers do about it, and gcc does some quite convoluted thing to read through each array member, or what exactly does that wall of instructions do (too lazy to check in debugger), while clang compiler does see through it and produces the simplified formula instead: https://godbolt.org/g/p2MGHs

P.S.第一个循环可以简化为:

P.S. the first loop can be simplified down to:

int loop1_fix(int *a, int x, int n) {
    if (0 < n) return a[n-1]*x*x*x*x;
    else return x*x*x;
}

这篇关于汇编程序任务:翻译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆