对于x86,如何测量低于十亿分之一秒的经过时间? [英] How to measure the elapsead time below nanosecond for x86?

查看:99
本文介绍了对于x86,如何测量低于十亿分之一秒的经过时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经搜索并使用了许多方法来测量经过时间.为此有很多问题.例如,这个问题非常好,但是当您需要一个精确的时间记录器时,我找不到一个好的方法.为此,我想在这里分享我的方法,以便在出现问题时使用并予以纠正.

I have searched and used many approaches for measuring the elapsed time. there are many questions for this purpose. For example, this question is very good but when you need an accurate time recorder I couldn't find a good method. For this, I want to share my method here to be used and be corrected if something is wrong.

更新&注意::该问题用于基准化,小于一纳秒.它与使用clock_gettime(CLOCK_MONOTONIC,&start);完全不同,它记录的时间超过一纳秒.

UPDATE&NOTE: this question is for Benchmarking, less than one nanosecond. It's completely different from using clock_gettime(CLOCK_MONOTONIC,&start); it records time more than one nanosecond.

更新::衡量加速的一种常用方法是重复执行该程序的一部分,以对其进行基准测试.但是,正如评论中提到的那样,当研究人员依赖自动矢量化时,它可能会显示出不同的优化.

UPDATE : A common method to measure the speedup is repeating a section of the program which should be benchmarked. But, as mentioned in comment it might show different optimization when the researcher rely on autovectorizing.

注意一次重复测量经过的时间不够准确.在某些情况下,我的结果表明,该部分必须重复执行1K或1M以上才能获得最短的时间.

NOTE It's not accurate enough to measure the elapsed time in one repeatinng. In some cases my results show that the section must be repeated more than 1K or 1M to get the smallest time.

建议::我对Shell编程不熟悉(只知道一些基本命令...),但是有可能可以在不重复执行程序的情况下测量最小的时间.

SUGGESTION : I'm not familiar with shell programming (just know some basic commands...) But, it might be possible to measure the smallest time with out repeating inside the program.

我的当前解决方案:为了防止出现分支,我使用宏#define REP_CODE(X) X X X... X X重复了ode部分,其中X是我要进行基准测试的代码部分,如下所示:

MY CURRENT SOLUTION In order to prevent the branches I repeat the ode section using a macro #define REP_CODE(X) X X X... X X which X is the code section I want to benchmark as follows:

//numbers
#define FMAX1 MAX1*MAX1
#define COEFF 8 
int __attribute__(( aligned(32))) input[FMAX1+COEFF];           //= {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17};
int __attribute__(( aligned(32))) output[FMAX1];
int __attribute__(( aligned(32))) coeff[COEFF] = {1,2,3,4,5,6,7,8};//= {1,1,1,1,1,1,1,1};//;            //= {1,2,1,2,1,2,1,2,2,1};

int main()
{
    REP_CODE(
        t1_rdtsc=_rdtsc();
        //Code
        for(i = 0; i < FMAX1; i++){
            for(j = 0; j < COEFF; j++){//IACA_START
                output[i] += coeff[j] * input[i+j]; 

            }//IACA_END
        }
        t2_rdtsc=_rdtsc();
        ttotal_rdtsc[ii++]=t2_rdtsc-t1_rdtsc;
        )
    // The smallest element in `ttotal_rdtsc` is the answer
}

这不会影响优化,但还会受到代码大小的限制,并且在某些情况下编译时间会过多.

This does not impact the optimization but also is restricted by code size and compiling time is too much in some cases.

有什么建议和更正吗?

谢谢.

推荐答案

如果您对自动矢量化程序有疑问并希望对其进行限制,只需在begin_rdtsc之后添加asm("#somthing");,它将分隔do-while循环.我刚刚检查了一下,它对您发布的代码进行了矢量化,而自动矢量化程序无法对其进行矢量化. 我更改了您可以使用的宏....

If you have problem with autovectorizer and want to limit it just add a asm("#somthing"); after your begin_rdtsc it will separate the do-while loop. I just checked and it vectorized your posted code which auto vectorizer was unable to vectorize it. I changed your macro you can use it....

long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc[do_while], ttbest_rdtsc = 99999999999999999, elapsed,  elapsed_rdtsc=do_while, overal_time = OVERAL_TIME, ttime=0;
int ii=0;
    #define begin_rdtsc\
                    do{\
                        asm("#mmmmmmmmmmm");\
                        t1_rdtsc=_rdtsc();

    #define end_rdtsc\
                        t2_rdtsc=_rdtsc();\
                        asm("#mmmmmmmmmmm");\
                        ttotal_rdtsc[ii]=t2_rdtsc-t1_rdtsc;\
                    }while (ii++<do_while);\    
                    for(ii=0; ii<do_while; ii++){\
                        if (ttotal_rdtsc[ii]<ttbest_rdtsc){\
                            ttbest_rdtsc = ttotal_rdtsc[ii];}}\             
                    printf("\nthe best is %lld in %lld iteration\n", ttbest_rdtsc, elapsed_rdtsc);

这篇关于对于x86,如何测量低于十亿分之一秒的经过时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆