优化C. [英] Optimizing C

查看:64
本文介绍了优化C.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



如果不采用asm块,我正在研究一些操作位掩码的小例程

。我正在寻找关于编写C

的任何指导,如果可能的话,这会使编译器倾向于将一个

编译器/底层处理器独立的方式:尽管如此公平的我

不能在除x86以外的任何东西上看到这些东西,但是谁知道。


我在这里找到了一些正确的信息:

http://www.eventhelix.com/RealtimeMa ... AndCPPCode.htm
http://www.devx .com / amd / Article / 21314


发生额外不必要的cpu​​带宽,那么做标准位小提琴的任何技术都会很好。按标准

位小提琴我的意思是:shift&旋转,和,oring,shift to bit

found etc.操作将在

字的序列上进行,最小的字将是标准的CPU字大小(32

位)。通过单词序列:例如,我可能会在64个单词的序列中找到左边第一个非零位的



任何指针赞赏。


Without resorting to asm chunks I''m working on a few small routines
which manipulate bitmasks. I''m looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.

推荐答案

Richard G. Riley写道:
Richard G. Riley wrote:
不求助于asm chunk我正在工作在几个操作位掩码的小例程上。我正在寻找任何关于编写C
的指导,如果可能的话,倾斜编译器的方式,一个
编译器/底层处理器独立的方式:尽管是公平的我不能在x86以外的任何东西上看到这些东西,但谁知道呢。

我在这里找到了一些好的信息:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

但是,如果没有发生额外不必要的cpu​​带宽,那么做任何标准位小提琴的技术都会很好。按标准的比特小提琴我的意思是:shift&旋转,和,oring,shift til bit
found等。操作将在
字的序列上进行,最小的字将是标准的CPU字大小(32
位)。通过单词序列:例如,我可能会以64个字的顺序从左边找到第一个非零位。
Without resorting to asm chunks I''m working on a few small routines
which manipulate bitmasks. I''m looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.




首先,它一点也不清楚你想要加速的是什么,以及你所测量的并且发现速度太慢。 A

" generic"优化往往是错误的;优化on

一般原则通常是不明智的。如果你有一个特定的

操作,你想加快速度,那么显示你正在使用的实际代码是个好主意,以及测量

的当前速度和估算的加速量

你必须实现才能使代码变得有用。


至于在64个整数的数组中找到第一个非零位,

要做的第一件事就是用第一个来确定你的意思。以免

字节序问题污染解决方案。完成后,你还要确定答案应该采用什么形式 - 它可能比一位索引更简单(或者更难)获得一点掩码,并且你可以通过调整其余代码来使用更多

轻松获得的答案来获益。


如果一个 - 比特预计相当稀疏,你可能会想要搜索大块的块。找到一个不是全部为零的b $ b,然后在该块中搜索正确的位。

你可以一次搜索一个`int''使用普通循环,

或者你可以使用memchr()来逐字节搜索。 memchr()

方法可能(或可能不会)花费更长时间进行搜索,但请注意

它为逐位搜索留下了较小的空间探索。

你需要实现这两个并测量。


如果你使用了memchr()并找到了一个非零字节,它可以使用预先计算的256个元素的表来完成工作,而无需进一步的计算。
可能是有意义的。然后,它可能不会:

现代CPU'比内存快得多,并且可能能够在比a更快的时间内完成多步计算内存提取。

需要更多测量。


第二次搜索,或者您可以先使用几个比较

来决定四个字节中的哪一个保存第一个。非零和

然后继续进行,好像memchr()找到了那个字节。哪个会更快?
?测量,测量,测量。


如果你需要一点索引并且不想使用表格,

明显的测试和转移计算方法可能是最好的。如果

你需要一个掩码而不是一个索引,那么`x& (x - 1)''技巧

可能更快。测量,测量, - 这里有回音吗?


最后,你引用的两个网页的印象。一个

的快速浏览表明他们充满了善意和

有时是合理的建议,但所有这些都是脱离背景的。

(当然必然如此:网页的作者没有想到你想写什么样的程序。)即使是专家

建议也可以脱离背景时变得胡说八道。你的祖母的医生(大概)是专家;你是这样的吗?
服用与你奶奶相同的药物?

恕我直言,最好阅读并理解这些的建议

页面提供,但不要跟随它,除非你有理由

相信它适用于你的情况。


-

Eric Sosman
es ***** @ acm-dot-org.inva lid



First, it''s not at all clear what you''re trying to speed
up, nor what you have measured and found to be too slow. A
"generic" optimization is often ill-directed; optimizing "on
general principles" is often ill-advised. If you have a specific
operation you''d like to speed up, it''d be a good idea to show
the actual code you''re currently using, along with measurements
of its current speed and an estimate of the amount of speedup
you simply MUST achieve to make the code useful.

As to finding the first non-zero bit in an array of 64 ints,
the first thing to do is nail down what you mean by "first," lest
endianness issues pollute the solutions. Having done that, you
should also decide what form the answer should take -- it may be
simpler (or harder) to get a bit mask than a bit index, and you
may benefit by adjusting the rest of the code to use the more
easily obtained answer.

If one-bits are expected to be fairly sparse, you''ll probably
want to search through largish "chunks" to find one that isn''t
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int'' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You''d need to implement both and measure.

If you''ve used memchr() and have located a non-zero byte, it
might make sense to use a precalculated 256-element table to finish
the job without further "computation." Then again, it might not:
modern CPU''s are much faster than memory, and might be able to
complete a multi-step computation in less time than a memory fetch.
More measurements are needed.

If you''ve searched int-by-int you could dive directly into
the second search, or you could first use a couple of comparisons
to decide which of the four bytes holds the "first" non-zero and
then proceed as if memchr() had located that byte. Which will be
faster? Measure, measure, measure.

If you need a bit index and don''t want to use a table, the
obvious test-and-shift-and-count approach is probably best. If
you need a mask as opposed to an index, the `x & (x - 1)'' trick
may be faster. Measure, measure, -- is there an echo in here?

Finally, an impression of the two Web pages you cited. A
quick glance suggests they''re filled with well-intentioned and
sometimes reasonable advice, but ALL of it is out of context.
(Necessarily so, of course: the authors of the Web pages have no
idea what kind of program you''re trying to write.) Even expert
advice can become nonsense when taken out of context. Your
grandmother''s doctor is (presumably) an expert; do you therefore
take the same medicines that are prescribed for your grandmother?
IMHO, it is best to read and understand advice of the sort these
pages offer, but NOT to follow it until and unless you have reason
to believe it applies to your situation.

--
Eric Sosman
es*****@acm-dot-org.invalid


Richard G. Riley写道:
Richard G. Riley wrote:
没有诉诸asm chunk我正在做一些小例程
操纵位掩码。我正在寻找任何关于编写C
的指导,如果可能的话,倾斜编译器的方式,一个
编译器/底层处理器独立的方式:尽管是公平的我不能在x86以外的任何东西上看到这些东西,但谁知道呢。

我在这里找到了一些好的信息:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

但是,如果没有发生额外不必要的cpu​​带宽,那么做任何标准位小提琴的技术都会很好。按标准的比特小提琴我的意思是:shift&旋转,和,oring,shift til bit
found等。操作将在
字的序列上进行,最小的字将是标准的CPU字大小(32
位)。通过单词序列:例如,我可能会在64个单词的序列中找到左起第一个非零位。

任何指针都赞赏。
Without resorting to asm chunks I''m working on a few small routines
which manipulate bitmasks. I''m looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.



也许 http://www.jjj.de/bitwizardry / 是一个好的开始。


a +,ld。



Maybe http://www.jjj.de/bitwizardry/ is a good start.

a+, ld.


2006-03-13,Eric Sosman < ES ***** @ ACM-点org.invalid>写道:
On 2006-03-13, Eric Sosman <es*****@acm-dot-org.invalid> wrote:
Richard G. Riley写道:
Richard G. Riley wrote:
没有诉诸asm chunk我正在研究一些操作位掩码的小例程
。我正在寻找任何关于编写C
的指导,如果可能的话,倾斜编译器的方式,一个
编译器/底层处理器独立的方式:尽管是公平的我不能在x86以外的任何东西上看到这些东西,但谁知道呢。

我在这里找到了一些好的信息:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

但是,如果没有发生额外不必要的cpu​​带宽,那么做任何标准位小提琴的技术都会很好。按标准的比特小提琴我的意思是:shift&旋转,和,oring,shift til bit
found等。操作将在
字的序列上进行,最小的字将是标准的CPU字大小(32
位)。通过单词序列:例如,我可能会按照64个单词的顺序从左边找到第一个非零位。
首先,它根本不清楚你是什么试图加快速度,或者你测量的速度,发现速度太慢。 A
Without resorting to asm chunks I''m working on a few small routines
which manipulate bitmasks. I''m looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMa...AndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.
First, it''s not at all clear what you''re trying to speed
up, nor what you have measured and found to be too slow. A




嗨Eric,


我正在寻找有关上述运算符的建议/指针。它的价格不是太慢了:它是一个确保或者b / b
尽早达到最佳性能的问题。几年前,当我编写高性能的
vga库时,我做了很多

这些东西,但是想要,如果可能的话,现在就把它全部留在C中。 />

从我的开头段开始:


shift&旋转操作:左和右右移操作

anding:按位和操作

oring:按位或操作


我正在编写一些小库来生成图像面具,

透明面具和边缘轮廓。下面还有一些细节。

" generic"优化往往是错误的;优化on
一般原则通常是不明智的。如果你想加速特定的操作,那么显示你当前正在使用的实际代码以及测量值是个好主意
它的当前速度和加速量的估计
你只需要实现使代码有用。


我正在寻找一般指导方针:没什么特别的。

具体。虽然你确实给了一些好的指导,其中一些已经克服了我的想法,但肯定不是全部。

关于在数组中找到第一个非零位64个整体,
要做的第一件事就是用第一个来确定你的意思。以免
字节序问题污染解决方案。完成后,你需要b $ b

详情请注意:但在这里我们可以假设序列是

高位==离开大多数像素。此外,任何颜色问题都被隔离

到三个或更多颜色平面:初始代码可以采用单色

位图。


我真的不认为endian进入它,因为操作将在单词级别处理
。即使它不是单色然后

最具体的X位将是最左边的像素,其中X是颜色

深度:在这种情况下我会看到唯一的代码问题是

增加测试掩模的位宽和移位数

为了检查下一个像素。


(只是为了澄清endian的事情:所有的工作都在内存缓冲区和

与视频硬件寻址无关。)

也应该决定什么形式的答案应该采取 - 它可能更简单(或更难)获得一点掩码而不是一点索引,你可以通过调整其余代码来使用更容易
得到答案。


这是值得考虑的事情。虽然我一直认为只是

anding并且转移0x80000000直到非零结果是

表示。可以通过减去一个来生成掩码。

(例如0x4000000-1 - > 0x3fffffff)并读取该位。


左像素后检测到我们检测到的最正确,如果我们是

为扫描线的其余部分创建一个单词掩码,那么它很简单。

如果预计一位数相当稀疏,你可能会想要搜索大块的块。找到一个不是全零的东西,然后在那个块中搜索适当的位。
你可以用普通的循环一次搜索int,
或者您可以使用memchr()逐字节搜索。 memchr()
方法可能(或可能不会)花费更长的时间进行搜索,但请注意,它为逐位搜索留下了较小的空间来探索。
你'需要同时实施和衡量。


是的,在寻找一个设定位之前检查字为0:足够简单&

非常实用的操作,在字时没有真正的开销/>
读。我不希望图像稀疏,所以我不确定它是否值得调用库找到它的开销:考虑左边

大多数像素检测:所以在C中,这对我来说似乎相当优秀(和

这类事情我正在寻求建议)


(早期警告:所有伪c代码假设32位字为清晰而不是

在类型大小中丢失)

位图= 0;

而(columnsToCheck--& &!(bitmap = * imagePtr ++));

if(bitmap){

/ *现在找到最左边的像素* /

}

应该足够理想吗?

如果你使用了memchr()并且找到了非零字节,那么


memchr()肯定不会列入议程我想:它需要一个

字符进行比较而隐含识别零

或非零 ;是比较快的。另外,我读的是单词(正确对齐

课程)而不是字节:字节肯定会扼杀应用程序。

可能有意义使用预先计算的256元素表完成工作而无需进一步的计算。然后,它可能不会:


你可以稍微扩展一下这个表:我不知道这个。

它是CPU特有的吗?

现代CPU比内存快得多,并且可能能够在更少的时间内完成多步计算时间比获取内存更多。
需要更多的测量。

如果你已经在int-by-int中搜索,你可以直接潜入第二次搜索,或者你可以首先使用几个比较来决定四个字节中的哪一个保持第一字节。非零,然后继续进行,就像memchr()找到那个字节一样。哪个会更快?测量,测量,测量。

如果您需要一点索引并且不想使用表格,那么明显的测试和移位计数方法可能是最好的。如果你需要一个面具而不是一个索引,那么`x& (x - 1)''技巧


你可以扩展吗?我会看到它很简单


(找到左边的像素掩码)


/ *我们知道位掩码是非零所以有点检查(bit = 0x8000000;!(bit& bitmap); bit>> = 1);

/ *字位掩码保护所有位是必须的* /

从最左边的像素开始* /

leftWordMask = bit | (第1位);

可能更快。测量,测量, - 这里有回音吗?


我在Vax,Convex和MS上使用了一些分析器:但是将在linux上使用

gcov进行初始运行时分析 - 但是我''我不确定它是否是我想要的实际执行时间分析。但我肯定是在寻找C提示。首先使用位图的更快方法:然后我可以决定采用一种设计方法:我认为上面的代码(愚蠢的错误不是
),看起来相对较快,核心循环这些区域足够小,我希望他们的执行足迹能够被缓存。

最后,你引用的两个网页的印象。快速浏览一下,表明他们充满了善意和有时合理的建议,但所有这些都是脱离背景的。


是&不:它对标准的东西有一些一般性的建议可以帮助

:对齐,2的幂等。它甚至讨论在某个阶段使用机器字

而不是更小我认为可能导致过多的单位开销。对于第一次在C中开始优化任何东西的人来说当然很有帮助。

(当然必然如此:网页的作者没有
想知道你正在尝试写什么样的程序。)即使专家的建议在脱离背景时也会变得无稽之谈。你的祖母的医生(大概)是专家;你是否因此为你的祖母服用相同的药物?
恕我直言,最好阅读并理解这些页面所提供的建议,但不要遵循它直到和除非你有理由相信它适用于你的情况。



Hi Eric,

I''m looking for advice/pointers on the operators mentioned above. Its
not a case of anything being too slow : its a question of ensuring or
approaching optimum performance as early as possible. I did a lot of
this stuff at assembler level years ago when writing high performance
vga libraries, but want, if possible to leave it all in C now.

From my opening paragraph :

shift & rotate operations : left & right shift operations
anding : bitwise and operations
oring : bitwise or operations

I am writing some small libraries for generating image masks,
transparancy masks and edge silhouettes. Some more of the details are below.
"generic" optimization is often ill-directed; optimizing "on
general principles" is often ill-advised. If you have a specific
operation you''d like to speed up, it''d be a good idea to show
the actual code you''re currently using, along with measurements
of its current speed and an estimate of the amount of speedup
you simply MUST achieve to make the code useful.
I''m looking for general guidelines : nothing spectacuarly
specific. Although you do give some good guidlines, some of which had
crossed my mind already, but certainly not all.

As to finding the first non-zero bit in an array of 64 ints,
the first thing to do is nail down what you mean by "first," lest
endianness issues pollute the solutions. Having done that, you
Thats in the details : but here we can assume that the sequence is
high bit == left most pixel. Also that any colour issues are segregated
into 3 or more color planes : initial code can assume a monochrome
bitmap.

I dont really think endian comes into it since the operations will
be taken care of at the word level. Even if it wasnt monochrome then
the most specific X bits would be leftmost pixel where X is the color
depth : in this case I would see the only code issues being that of
increasing the bit width of the test mask and the number of shifts
done in order to check the next pixel.

(Just to clarify the endian thing : all work is in memory buffers and
is in no way related to video HW addressing.)
should also decide what form the answer should take -- it may be
simpler (or harder) to get a bit mask than a bit index, and you
may benefit by adjusting the rest of the code to use the more
easily obtained answer.
Thats something to think about ok. Although I always considered just
anding and shifting 0x80000000 until a non zero result was
indicated. The mask can be generated from the bit by subtracting one.
(e.g 0x4000000-1 -> 0x3fffffff) and readding the bit.

After left pixel is detected we detect right most and if we are
creating a mask of words for rest of scanline its then trivial.

If one-bits are expected to be fairly sparse, you''ll probably
want to search through largish "chunks" to find one that isn''t
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int'' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You''d need to implement both and measure.
Yes, check word for 0 before looking for a set bit : simple enough &
very practical operation with no real overhead at time of word
read. I dont expect the images to be sparse so I''m not sure its
worth the overhead of calling a library to find it : considering left
most pixel detection: so In C, this seems fairly optimal to me (and
this type of thing I''m looking for advice on)

(early caveat : all pseudo c code assuming 32bit word for clarity and not
getting lost in type sizes)
bitmap =0;
while(columnsToCheck-- && !(bitmap = *imagePtr++));
if(bitmap){
/* now find leftmost pixel */
}
should be optimal enough?

If you''ve used memchr() and have located a non-zero byte, it
memchr() would certinaly not be on the agenda I think : it needs a
character to compare whereas the implicit recognition of "zero"
or "non zero" is faster. Also, I read in words (properly aligned of
course) and not bytes : bytes would certainly strangle the application.
might make sense to use a precalculated 256-element table to finish
the job without further "computation." Then again, it might not:
Cou you expand on that table bit a little : I dont know about this. Is
it something CPU specific?
modern CPU''s are much faster than memory, and might be able to
complete a multi-step computation in less time than a memory fetch.
More measurements are needed.

If you''ve searched int-by-int you could dive directly into
the second search, or you could first use a couple of comparisons
to decide which of the four bytes holds the "first" non-zero and
then proceed as if memchr() had located that byte. Which will be
faster? Measure, measure, measure.

If you need a bit index and don''t want to use a table, the
obvious test-and-shift-and-count approach is probably best. If
you need a mask as opposed to an index, the `x & (x - 1)'' trick
Could you expand on that? I would see it as simple as

(find left pixel mask)

/* we know the bitmask is non zero so a bit check is a must*/
for(bit=0x8000000;!(bit&bitmap);bit>>=1);
/*word bit mask protects all bits from leftmost pixel on*/
leftWordMask = bit | (bit-1);
may be faster. Measure, measure, -- is there an echo in here?
Ive used a few profilers on Vax, Convex and on MS : but will be using
gcov for initial run time analysis on linux - but I''m not sure if its
what I want for real execution time profiling. But I''m certainly
looking for "C hints" on faster ways with bitmaps first : then I can
decide on a design approach : I think the code above (silly errors not
withstanding), appears to be relatively quick and the core "loop" areas are
small enough where I would expect their execution footprint to get cached.

Finally, an impression of the two Web pages you cited. A
quick glance suggests they''re filled with well-intentioned and
sometimes reasonable advice, but ALL of it is out of context.
Yes & no : it has some general advice on standard stuff which can help
: alignment, powers of two etc. It even discusses use of machine words
at some stage instead of smaller units which can cause excessive
overhead I think. Certainly helpful to anyone embarking on optimizing
anything in C for the first time.
(Necessarily so, of course: the authors of the Web pages have no
idea what kind of program you''re trying to write.) Even expert
advice can become nonsense when taken out of context. Your
grandmother''s doctor is (presumably) an expert; do you therefore
take the same medicines that are prescribed for your grandmother?
IMHO, it is best to read and understand advice of the sort these
pages offer, but NOT to follow it until and unless you have reason
to believe it applies to your situation.




我感谢你的回复,


谢谢花时间这样做。



I appreciate your reply,

thanks for taking the time to do so.


这篇关于优化C.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆