将未更改的计算移出循环是否会提高程序的性能? [英] Does moving out a unchanged computation out of a loop improve a program's performance?

查看:41
本文介绍了将未更改的计算移出循环是否会提高程序的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Computer Systems: a Programmer's Perspective 中,有一个优化程序性能的例子.将 'A'-'a' 移出循环会进一步提高程序的性能吗?为什么更快的版本中不包含更改?谢谢.

In Computer Systems: a Programmer's Perspective, there is an example for optimizing program performance. Will moving 'A'-'a' out of the loop further improve the program's performance? Why isn't the change included in the faster version? Thanks.

来自图 5.7 小写转换例程"的代码.这两个程序有着根本的不同性能":

The code from "Figure 5.7 Lowercase conversion routines. The two procedures have radically different performance":

/* Convert string to lowercase: slow */
void lower1(char *s)
{
  long i;

  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

/* Convert string to lowercase: faster */
void lower2(char *s)
{
  long i;
  long len = strlen(s);

  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

/* Sample implementation of library function strlen */
/* Compute length of string */
size_t strlen(const char *s)
{
  long length = 0;
  while (*s != '\0') {
    s++;
    length++;
  }
  return length;
}

推荐答案

TL;DR:我在 2 个系统上进行了测试,结果在两个系统上都没有更快.但是您可以通过删除 strlen() 使代码更快.

TL;DR: I tested it on 2 systems and no, it was not faster on both systems. But you can make the code faster with removing the strlen().

我测试了这两个功能以及更多功能.我测试了当我用 +0x20 替换 -('A'-'a') 时会发生什么,用 |<替换 +/code> 并将 for 循环替换为 while 循环.结果是这不会改变性能.

I tested both functions and a lot more. I tested what happens when i replace -('A'-'a') with +0x20, replaced + by | and replacing the for loop with a while loop. Result is that this does not change performance.

但是,将 strlen() 移到开头并且只调用一次可以提高性能.进一步提高性能的一种方法是删除对 strlen() 的调用,并在同一循环中检查 '\0' 字节.这样我们只需要遍历一次循环,这可能会减少较长字符串的缓存未命中.

However, moving strlen() to the beginning and only call it once improved performance. A way to improve performance even more was to remove the call to strlen() and check for the '\0'-byte inside the same loop. This way we have to go trough the loop only once, this probably reduces cache misses on longer strings.

我通过创建一个随机字符串数组来测试它,复制数组并用一种方法降低所有副本,同时测量降低时间.然后我用相同的随机字符串数组和所有其他方法做了同样的事情.我重复了多次.

I tested it by creating an array of random strings, copy the array and lowering all copies with one method while measuring the lowering time. Then i did the same, with the same random array of strings, with all the other methods. And i repeated this multiple times.

代码应该适用于 POSIX 兼容系统,但您可能需要为其他系统(如 Windows)替换 GetTime() 函数.我用 GCC 和 -O3 标志编译它.

The code should work on POSIX compatible systems, but you probably have to replace the GetTime() function for other systems such as Windows. I compiled it with GCC and the -O3 flag.

#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <errno.h>
#include <stdlib.h>
#include <inttypes.h>

#include <time.h>


//#define DEBUG


#ifdef DEBUG
  #define N 10
#else
  #define N 1000UL*100
#endif

#define M 20

#define STR_(x) #x
#define STR(x) STR_(x)



void lower1(char *s)
{
  size_t i;

  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= ('A' - 'a');
    }
  }
}

void lower2(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= ('A' - 'a');
    }
  }
}

void lower3(char *s)
{
  size_t i;
  size_t len = strlen(s);
  int d='A'-'a';

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= d;
    }
  }
}

void lower4(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] += 0x20;
    }
  }
}

void lower5(char *s)
{
  size_t i;

  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] += ('a' - 'A');
    }
  }
}

void lower6(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] |= 0x20;
    }
  }
}

void lower7(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] |= 0x20;
    }
  }
}

void lower8(char *s)
{
  size_t len = strlen(s);
  while(len--)
    {
      if (*s >= 'A' && *s <= 'Z')
      {
        *s |= 0x20;
      }
      s++;
    }
}

void lower9(char *s)
{
  while(1)
  {
    if (!*s)
    {
      break;
    }
    if (*s >= 'A' && *s <= 'Z')
    {
      *s |= 0x20;
    }
    s++;
  }
}

void lowerA(char *s)
{
  while(*s)
  {
    if (*s >= 'A' && *s <= 'Z')
    {
      *s |= 0x20;
    }
    s++;
  }
}

uint64_t die(const char *msg)
  {
    fprintf(stderr,"die: %s : %s\n",msg,strerror(errno));
    exit(1);
  }

uint64_t getTime(void)
  {
    uint64_t time;
    struct timespec  t_v;
    if(clock_gettime(CLOCK_BOOTTIME,&t_v)<0)
      {
        die("cant get time");
      }
    time=t_v.tv_sec*1000000000ULL;
    time+=t_v.tv_nsec;
    return time;
  }
  

void test(void (*fp)(char *),char (*s)[M],const char *name)
  {
    static char (*copy)[M];
    copy=malloc(N*M);
    if(!copy)
      {
        die("can't alloc memory");
      }
    memcpy(copy,s,N*M);
    uint64_t start=getTime();
    for(size_t u=0;u<N;u++)
      {
        fp(copy[u]);
      }
    uint64_t end=getTime();
    printf("time %13"PRIu64" %s\n",end-start,name);
    #ifdef DEBUG
      for(size_t u=0;u<N;u++)
        {
          printf("%3zu %"STR(M)"s %"STR(M)"s\n",u,s[u],copy[u]);
        }
    #endif
    free(copy);
  }
  
void runTest(void)
{
  //create a random string
  srand(getTime());
  static char string[N][M];
  for(size_t u=0;u<N;u++)
  {
    size_t l=rand()%M;
    for(size_t i=0;i<l;i++)
    {
      string[u][i]=rand()%('z'-'/')+'/';
    }
    string[u][l]='\0';
  }
  #define TEST(s) test(s,string,STR(s))
  TEST(lower1);
  TEST(lower2);
  TEST(lower3);
  TEST(lower4);
  TEST(lower5);
  TEST(lower6);
  TEST(lower7);
  TEST(lower8);
  TEST(lower9);
  TEST(lowerA);
}

int main(void)
{
  for(unsigned i=0;i<8;i++)
  {
    runTest();
  }
  return 1;
}

AMD64 上的反汇编显示函数 lower1()lower5()lower6()(调用 strlen 的函数() 在每个循环中,编译器没有优化该调用) 几乎相同,除了地址和 add 指令被替换为 or 指令.lower2()lower3()lower4()lower7()(函数 where strlen() 只被调用一次和 for 被使用)也几乎相同.lower8() 彼此不同(使用一次 strlen() 和一个 while 循环).loop9()loopA() 几乎相同,不调用 strlen())

The disassembly on AMD64 shows that functions lower1(), lower5() and lower6() (functions that call strlen() in every loop, compiler did not optimize that call) are almost identical with the exception of addresses and that a add instructions was replaced by the or instruction. lower2(), lower3(), lower4() and lower7() (functions where strlen() is only called once and for is used) are also almost identical. lower8() is different from each other (uses strlen() once and a while-loop). loop9() and loopA() are almost identical and do not call strlen())

在我在 Raspberry Pi 上运行的 Debian 9 Stretch ARM 上,函数 lower9()lowerA() 与所有其他经过测试的函数一样快.lower2()lower3()lower4()lower7()lower8() 花费了大约 58-66% 的时间,但彼此相同.为 lower8() 分配不同的程序集,执行时间没有显着差异.lower1()lower6()lower9()lowerA() 花费了大约 297-348% 的时间,有趣的是 lower5() 花费的时间更长(在多次测量中一致),达到 324%-375%.我不知道为什么 lower5() 用了更长的时间,因为它使用相同的机器代码,除了地址不同(这对于 ARM 代码也是如此).

On my Debian 9 Stretch ARM running on a Raspberry Pi, the functions lower9() and lowerA() are equally as fast and faster than all other tested functions. lower2(), lower3(), lower4(), lower7() and lower8() took about 58-66% more time but are equally to each other. Dispate the different assembly for lower8() the execution time did not differ significantly. lower1() and lower6() took about 297-348% longer than lower9() and lowerA(), interestingly lower5() took even longer (consistent in multiple measurements) with 324%-375%. I do not know why lower5() took longer since it uses the same machine code except for different addresses (this is also true for the ARM code).

在我的 Debian 10 Buster AMD64 上,函数 lowerA() 是最快的,比 lower9() 快约 3%-6%.我不知道为什么.但是 lower5() 在这里与 lower1()lower6() 一样快.

On my Debian 10 Buster AMD64, the function lowerA() is the fastest, faster than lower9() by about 3%-6%. I don't know why. But lower5() is here as fast as lower1() and lower6().

这篇关于将未更改的计算移出循环是否会提高程序的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆