低级,丑陋的指针算术 [英] low-level, ugly pointer arithmetics

查看:42
本文介绍了低级,丑陋的指针算术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



嗨!


我正试图从一个紧凑的

循环中挤出几个时钟周期在我的程序中显示出一个瓶颈。

我正处在一个重要的事情就是

执行速度,而不是样式(在循环内,显然)。


循环处理数组:


//这在高速缓存行边界对齐

struct hash_t {

int id;

int next;

double d0;

double d1;

双d2;

}哈希表[HASH_MAX];


循环内,每当有哈希表错过我

需要将新值存储到哈希表[index]中。

最初这看起来像


if(...){

哈希表[index] .id = some_int;

哈希表[index] .next = some_int2;

哈希表[index] .d0 = some_double0;

哈希表[index] .d1 = some_double1;

哈希表[index] .d2 = some_double2;

}


现在......我正试图通过做某事来节省几个周期

沿着


if(...){

//指向第一个成员

int * hashtable_ptr = reinterpret_cast< int *& (hashtable [index] .id)

*(hashtable_ptr ++)= some_int; //存入id并继续前进

*(hashtable_ptr ++)= some_int2; //存入下一个并继续前进

*(hashtable_ptr ++)= some_double0; //<< ---麻烦

// ...

}


麻烦的是我正在使用指针to int

我希望存储一个double,稍后会推进

指针而不是sizeof(int)字节,而是通过sizeof(double)

字节。我的系统sizeof(int)是4,sizeof(double)是8

,便携性不是问题。


我试过


*((reinterpret_cast< double *>(hashtable_ptr))++)= some_double0;


但是我收到了一个错误:


"此演员表的结果不能用作左值。


为什么?我真的需要强制编译器将这个指针视为一个双*的片刻......我意识到我可以

有两个指向相同哈希表条目的指针,一个int *

和一个double *,但我需要保存每个时钟周期我可以这个循环执行数万亿次。


迭代指针的常用程序是什么?
不同类型的成员?也许用一个char *会更容易使用

,但这意味着推进4和8,我认为,这比普通的++慢?或者它是否会在装配级别上移动4或8个偏移?


提前感谢,

- J.


PS。指向不同数据类型的指针是否保证

的大小相同?如果没有,或许我需要

断言这里和那里......

解决方案

Jacek Dziedzic写道:


嗨!


我正试图从紧张中挤出几个时钟周期

循环,分析显示我的程序是一个瓶颈。

我正处在一个重要的事情是

执行速度,而不是样式(在循环,显然)。


循环处理数组:


//这在高速缓存行边界对齐

struct hash_t {

int id;

int next;

double d0;

double d1;

双d2;

}哈希表[HASH_MAX];

循环中的
,每当有哈希表错过我

需要将新值存储到哈希表[index]中。

最初这看起来像


if(...){

哈希表[index] .id = some_int;

哈希表[index] .next = some_int2;

哈希表[index] .d0 = some_double0;

hashtable [index] .d1 = some_double1;

哈希表[index] .d2 = some_double2;

}


现在..我正试图通过做一些事来节省几个周期

如果(......){b / b $

b $ b //指向第一个成员

int * hashtable_ptr = reinterpret_cast< int *&(hashtable [index] .id)

*(hashtable_ptr ++)= some_int ; //存入id并继续前进

*(hashtable_ptr ++)= some_int2; //存入下一个并继续前进

*(hashtable_ptr ++)= some_double0; //<< ---麻烦

// ...

}



通过指向int的指针访问double导致未定义的行为。

在任何情况下,除非你有一个10岁以上的编译器,否则编译器已经完成了这个优化(或者更好的 b) />
甚至一个)。如果你处于这种优化水平,那么你应该看看代码的解构,看看编译器生成和/或编写你自己的程序集的代码是什么常规。许多处理器都有

矢量指令或类似的东西可以帮助你。


问候,

巴特。


< snip>


哦还有一件事。由于您似乎只是复制了一个

POD结构的成员,因此memcpy()也是可以接受的,并且它可能也是为您的系统实现最优化的
。 />

问候,

Bart。


Bart写道:


[snip]

通过指向int的指针访问double会导致未定义的行为。



但是没有演员减轻这一点吗?我以为编译器

会理解我想要处理二进制表示

指向int的指针,就好像它是一个指向double的指针,所以

它会正确存储值,然后正确地增加它b / b
。这不行吗?


在任何情况下,除非你有一个10岁以上的编译器,否则编译器肯定是b / b
已经做了这个优化(或更好的

甚至一个)。如果你处于这种优化水平,那么你应该看看代码的解构,看看编译器生成和/或编写你自己的程序集的代码是什么常规。



是和否。编译器是一个月历史的英特尔编译器,

调整到这个特定的架构(Itanium 2),所以你要
期望它非常具有攻击性,更是如此因为这个
体系结构在很大程度上依赖于编译器进行

优化。


然而,即使使用-O3和其他积极的优化

选项我能够在两个地方超越编译器

只需转换


array [index] [0] = double0;

array [index] [1] = double1;

array [index] [2] = double2;



array_ptr = array [index];

*(array_ptr ++)= ...;

*(array_ptr ++)= ...;

*(array_ptr)= ...;


由探查器输出证明。编写我自己的程序集

例程是不可能的 - 这个IA64程序集输出

是绝对不可读的(至少对我来说,我只有

体验x86程序集)看起来好像已经非常优化了。尽管如此,过去三天

的经验表明,将索引转换为

指针算术以某种方式使编译器更好地执行



许多处理器都有

向量指令或类似的东西,可以帮助你。



是的,特别是这个处理器。我的队友管理了

来操纵pow()和sqrt()操作,这些操作是先前的瓶颈。看起来是下一个瓶颈

是以几乎随机的顺序访问大型
(~1e6双打)数组的元素的缓存惩罚,很多很多

次。因此,我试图编写一个哈希表,它将把b / b
中最近使用过的元素的值存储在适合L2缓存的

小表中。


无论如何,是否有_really_无法使用单个推进指针访问

a结构的元素?我怀疑

一些控制器编程C大师们会知道一种方式吗?


TIA,

- J.



Hi!

I''m trying to squeeze a few clock cycles from a tight
loop that profiling shows to be a bottleneck in my program.
I''m at a point where the only thing that matters is
execution speed, not style (within the loop, obviously).

The loop deals with an array:

// this is aligned at cache line boundaries
struct hash_t {
int id;
int next;
double d0;
double d1;
double d2;
} hashtable[HASH_MAX];

within the loop, whenever there is a hashtable miss I
need to store new values into hashtable[index].
Originally this looked like

if(...) {
hashtable[index].id=some_int;
hashtable[index].next=some_int2;
hashtable[index].d0=some_double0;
hashtable[index].d1=some_double1;
hashtable[index].d2=some_double2;
}

Now... I''m trying to save a few cycles by doing something
along the lines of

if(...) {
// point to the first member
int *hashtable_ptr = reinterpret_cast<int*&(hashtable[index].id)
*(hashtable_ptr++) = some_int; // store into id and move on
*(hashtable_ptr++) = some_int2; // store into next and move on
*(hashtable_ptr++) = some_double0; // << --- trouble
// ...
}

the trouble is that I''m working with a pointer to int
and I want to store a double and, later on, advance the
pointer not by sizeof(int) bytes, but by sizeof(double)
bytes. On my system sizeof(int) is 4, sizeof(double) is 8
and portability is not an issue.

I tried

*((reinterpret_cast<double*>(hashtable_ptr))++) = some_double0;

but I got an error:

"the result of this cast cannot be used as an lvalue".

Why''s that? I really need to force the compiler to treat
this pointer as a double* for a moment... I realize I can
have two pointers to the same hashtable entry, one an int*
and one a double*, but I need to save every clock cycle I
can as this loop is executed trillions of times.

What''s the usual procedure to iterate a pointer through
members of varying types? Perhaps it would be easier with
a char*, but that means advancing by 4 and 8 which,
I suppose, would be slower than plain ++? Or does it boil
to moving by 4 or 8 offsets at the assembly level too?

thanks in advance,
- J.

PS. Are pointers to different datatypes guaranteed to
be of the same size? If not, than perhaps I need
an assert here and there...

解决方案

Jacek Dziedzic wrote:

Hi!

I''m trying to squeeze a few clock cycles from a tight
loop that profiling shows to be a bottleneck in my program.
I''m at a point where the only thing that matters is
execution speed, not style (within the loop, obviously).

The loop deals with an array:

// this is aligned at cache line boundaries
struct hash_t {
int id;
int next;
double d0;
double d1;
double d2;
} hashtable[HASH_MAX];

within the loop, whenever there is a hashtable miss I
need to store new values into hashtable[index].
Originally this looked like

if(...) {
hashtable[index].id=some_int;
hashtable[index].next=some_int2;
hashtable[index].d0=some_double0;
hashtable[index].d1=some_double1;
hashtable[index].d2=some_double2;
}

Now... I''m trying to save a few cycles by doing something
along the lines of

if(...) {
// point to the first member
int *hashtable_ptr = reinterpret_cast<int*&(hashtable[index].id)
*(hashtable_ptr++) = some_int; // store into id and move on
*(hashtable_ptr++) = some_int2; // store into next and move on
*(hashtable_ptr++) = some_double0; // << --- trouble
// ...
}

Accessing a double through a pointer to int causes undefined behavior.
In any case, unless you have a 10+ year old compiler it is almost
certain the compiler has done this optimization already (or a better
one even). If you''re at this level of optimization then you should
probably look at the disasembly of the code to see what the compiler
generates and/or write your own assembly routine. Many processors have
vector instructions or similar stuff that could help you here.

Regards,
Bart.


<snip>

Oh and one more thing. Since you just seem to be copying members of a
POD structure then memcpy() is also acceptable, and it is probably
implemented optimally for your system as well.

Regards,
Bart.


Bart wrote:

[snip]
Accessing a double through a pointer to int causes undefined behavior.

But doesn''t a cast alleviate this? I thought the compiler
would understand that I want to treat the binary representation
of a pointer to int like it was a pointer to double, so that
it would store the value correctly and then increment it
correctly. Won''t this work?

In any case, unless you have a 10+ year old compiler it is almost
certain the compiler has done this optimization already (or a better
one even). If you''re at this level of optimization then you should
probably look at the disasembly of the code to see what the compiler
generates and/or write your own assembly routine.

Yes and no. The compiler is a month-old intel compiler,
tuned to this particular architecture (Itanium 2), so you''d
expect it to be extremely aggresive, more so since this
architecture relies heavily on the compiler doing the
optimizations.

However, even with -O3 and other aggressive optimization
options I was able to outsmart the compiler in two places
by merely converting

array[index][0]=double0;
array[index][1]=double1;
array[index][2]=double2;

into

array_ptr = array[index];
*(array_ptr++)=...;
*(array_ptr++)=...;
*(array_ptr )=...;

as proven by profiler output. Writing my own assembly
routine is out of question -- this IA64 assembly output
is absolutely unreadable (at least to me, I only have
experience with x86 assembly) and looks like it is
quite heavily optimized already. Still, experience of the
past three days shows that translating indexing into
pointer arithmetics somehow makes the compiler perform
better still.

Many processors have
vector instructions or similar stuff that could help you here.

Yes, this processor in particular. My teammate managed
to vectorize the pow() and sqrt() operations which were
the previous bottleneck. What appears as the next bottleneck
is cache penalties for accessing elements of a large
(~1e6 doubles) array in an almost random order, many many
times. Hence I am trying to code a hashtable that would
store the values of most recently used elements in a
smaller table that would fit within the L2 cache.

Anyway, is there _really_ no way to access elements of
a struct using a single, advancing pointer? I suspect
some controller-programming C gurus would know a way?

TIA,
- J.


这篇关于低级,丑陋的指针算术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆