ç性能和编译选项 [英] C performance and compiling options

查看:150
本文介绍了ç性能和编译选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个类似的实现(Java和C ++)为小事算法,如选择排序​​。

 公共接口SortingAlgorithm {    公共无效排序(INT []一);
}公共类选择排序实现SortingAlgorithm {    @覆盖
    公共无效排序(INT [] A){
        的for(int i = 0; I<则为a.length;我++){
            INT lowerElementIndex = I;
            对于(INT J = I + 1; J<则为a.length; J ++){
                如果(一个研究[J] LT; A [lowerElementIndex]){
                    lowerElementIndex = j的;
                }
            }
            掉期(一,lowerElementIndex,I);
        }
    }    私人无效掉期(INT []一,INT I,诠释J){
        如果(我== j)条{
            返回;
        }
        INT TEMP = a [i];
        一个由[i] = A [J]。
        一个研究[J] =温度;
    }
}

和C-之一:

 内嵌无效掉期(INT *一,INT I,诠释J);无效s_sort为(int *一,诠释大小){
  INT I;
  对于(i = 0; I<大小;我++){
    INT lowerElementIndex = I,J;
    为(J = I + 1; J<大小; J ++){
      如果(一个研究[J] LT; A [lowerElementIndex]){
    lowerElementIndex = j的;
      }
    }
    掉期(一,lowerElementIndex,I);
  }
}内嵌无效掉期(INT *一,INT I,诠释J){
  如果(我== j)条{
    返回;
  }
  INT TEMP = a [i];
  一个由[i] = A [J]。
  一个研究[J] =温度;
}

现在,我尝试了大阵(100000随机INT)上进行测试。
在第一次的结果
java的:〜17秒(编译并与Oracle JDK / JVM执行)
C:〜22秒(用gcc编译V4.8不使用任何优化)

当然,我则试图通过CFLAGS优化我的C版本。
结果(我只报告CFLAGS):
-O1:〜18.4

-O2:〜18.4

{-O 3-9}:〜20.9

现在,我的第一个问题是我应该使用哪个cflacs编译?

所以我读到优化了GNU手册。
添加-march =本地没有帮助。一段时间后,你花其他的选择,我来到-fprofile弧选项。它添加到我的国旗让我的code完成其测试在大约11秒!
然而,一些文件出现在我的文件夹:在分析的结果。据我了解,我应该-fbranch-概率使用它们,重新编译code。
在〜18.5秒再重新编译的结果。 这是我真正想问的。

怎么可能为我的程序运行得这么快,如果它有写入文件,收集分析信息,而不是将其运行速度较慢的1.5倍时,它也没有?

我忘了提,我是一个老的PC安装了英特尔赛扬2.8GHz的@处理器和Linux(Fedora的20与XFCE)上。如果您需要了解硬件的其他信息,只问! ;)

编辑:
在code我使用的测试方法是:

Java的:

 公共类的测试{    公共静态无效的主要(字串[] args){
        INT [] =一个新的INT [100000]。
        INT [] A2 =新INT [100000]。
        的for(int i = 0; I<则为a.length;我++){
            一个由[i] =(INT)(的Math.random()* 100000);
            A2 [I] = a [i];
        }
        选择排序S =新的选择排序();
        插入排序S1 =新插入排序();
        双启动= System.nanoTime();
        s.sort(一);
        双端= System.nanoTime();
        双倍时间=(最终开始)/1000000000.0;
        的System.out.println(选择:+时间);
        开始= System.nanoTime();
        s1.sort(A2);
        结束= System.nanoTime();
        时间=(最终开始)/1000000000.0;
        的System.out.println(插入+时间);
    }
}

以及C:

 的#includeinsertion_sort.h
#包括selection_sort.h
#包括LT&;&time.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&stdio.h中GT;
#包括LT&;&string.h中GT;诠释主(){
  INT最大= 100000,我;
  函数srand(时间(NULL));  int数组[100000]数组2 [100000]。
  对于(I = 0; I&下; 100000; I + = 1){
    数组[我] =兰特()%100000;
  }  的memcpy(数组2,与放,阵列[0],100000 * sizeof的(INT));  clock_t表示inizio =时钟();
  s_sort(数组,最大值);
  clock_t表示精=时钟();
  浮动tempoEsecuzione =(浮点)(精 - inizio)/ CLOCKS_PER_SEC;
  的printf(选择:%2.3f \\ n,tempoEsecuzione);  inizio =时钟();
  i_sort(数组2,最大值);
  精=时钟();
  tempoEsecuzione =(浮点)(精 - inizio)/ CLOCKS_PER_SEC;
  的printf(插入:%2.3f \\ n,tempoEsecuzione);
  返回0;
}

在code包含对我没有列入的问题,因为(如预期)的Java运行较慢的C的其余部分插入排序函数的引用。


解决方案

  

这是我真正想问的。


  
  

这怎么可能为我的程序运行得这么快,如果它有写
  文件和收集分析信息,而是它运行的1.5倍
  慢时,它也没有?


是的,这是这里真正的问题。所有提及了Java比较的东西只是增加了噪音。

我可以重现怪异的行为在我的机器上用gcc 4.7.2。毫不奇怪,code的热路径是内循环:

 为(J = I + 1; J<大小; J ++){
  如果(一个研究[J] LT; A [lowerElementIndex]){
    lowerElementIndex = j的;
}

在相应生成的汇编code,唯一相关的区别是:

快速案例:

  CMPL%ESI,ECX%
    JGE .L3
    MOVL%ECX,ESI%
    movslq%EDX,%RDI
.L3:

慢速情况:

  CMPL%ECX,ESI%
cmovl%EDX,EDI%
cmovl%ESI,ECX%

第一种情况(快速)可以从分支prediction 但其他(慢的情况下)显然不能。排序或随机洗牌数组不会造成太大分支误predictions 的。第一code段是在这种情况下最优的。

事实证明,它实际上是很难创造了导致选择排序很多分支误predictions的数据集。 (有人指出,由<一个href=\"http://stackoverflow.com/questions/21055946/why-does-tree-vectorization-make-this-sorting-algorithm-2x-slower#comment31663387_21055946\">Yakk;又见<一个href=\"http://stackoverflow.com/questions/21055946/why-does-tree-vectorization-make-this-sorting-algorithm-2x-slower#comment31676824_21055946\">my 的企图创造一个邪恶的数据,到目前为止,我没有创建一个)

-fprofile弧恰好禁用树矢量这似乎是负责产生缓慢的情况下,code。一个更好的方法来禁用树矢量是通过 -fno-树矢量标记。

铛3.4也产生快速的情况下,code,没有任何特殊的标志。在Java code的没有的热身跑2.4倍比C code慢。 (由于这是不是问题,我还没有研究改进Java的code性能。)

I've got two similar implementations (java and c++) for a trivial algorithm like the selection sort.

public interface SortingAlgorithm {

    public void sort(int[] a);
}

public class SelectionSort implements SortingAlgorithm {

    @Override
    public void sort(int[] a) {
        for (int i = 0; i < a.length; i++) {
            int lowerElementIndex = i;
            for (int j = i + 1; j < a.length; j++) {
                if (a[j] < a[lowerElementIndex]) {
                    lowerElementIndex = j;
                }
            }
            swap(a, lowerElementIndex, i);
        }
    }

    private void swap(int[] a, int i, int j) {
        if (i == j) {
            return;
        }
        int temp = a[i];
        a[i] = a[j];
        a[j] = temp;
    }
}

and the c one:

inline void swap(int* a, int i, int j);

void s_sort(int* a, int size) {
  int i;
  for (i = 0; i < size; i++) {
    int lowerElementIndex = i, j;
    for (j = i + 1; j < size; j++) {
      if (a[j] < a[lowerElementIndex]) {
    lowerElementIndex = j;
      }
    }
    swap(a, lowerElementIndex, i);
  }
}

inline void swap(int* a, int i, int j) {
  if (i == j) {
    return;
  }
  int temp = a[i];
  a[i] = a[j];
  a[j] = temp;
}

Now, I tried testing them on a large array (100000 random int). The results at first were java: ~17 sec (compiled and executed with oracle jdk/jvm) c: ~22 sec (compiled with gcc v4.8 without any optimization)

Of course, i then tried to optimize my c version through cflags. The results are(I'm reporting cflags only): -O1: ~18.4

-O2: ~18.4

-O{3-9}: ~20.9

Now, my first question is which cflacs should I use to compile?

So I read the gnu manual about optimizations. Adding -march=native didn't helped. After some time spent trying other options, I came into -fprofile-arcs option. Adding it to my flags made my code finish its test in about 11 seconds! However some files appeared in my folders: the results of the profiling. As I understand, I should use them with -fbranch-probabilities and recompile the code. Recompiling results again in ~18.5 sec. And this is what I really want to ask.

How is it possible for my program to run so fast if it has to write files and collect profiling information and instead it runs 1.5 times slower when it hasn't?

I forgot to mention that I'm on an old PC with a Intel Celeron @2.8GHz processor and linux (fedora 20 with xfce) installed. If you need other information about the hardware just ask! ;)

Edit: The code I use for the test is:

Java:

public class Test {

    public static void main(String[] args) {
        int[] a = new int[100000];
        int[] a2 = new int[100000];
        for (int i = 0; i < a.length; i++) {
            a[i] = (int)(Math.random()*100000);
            a2[i] = a[i];
        }
        SelectionSort s = new SelectionSort();
        InsertionSort s1 = new InsertionSort();
        double start = System.nanoTime();
        s.sort(a);
        double end = System.nanoTime();
        double time = (end-start)/1000000000.0; 
        System.out.println("Selection: "+time);
        start = System.nanoTime();
        s1.sort(a2);
        end = System.nanoTime();
        time = (end-start)/1000000000.0;
        System.out.println("Insertion: "+time);
    }
}

And the c:

#include "insertion_sort.h"
#include "selection_sort.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main() {
  int max = 100000, i;
  srand(time(NULL));

  int array[100000], array2[100000];
  for(i=0; i<100000; i+=1) {
    array[i] = rand()%100000;
  }

  memcpy(array2, &array[0], 100000 * sizeof(int));

  clock_t inizio = clock();
  s_sort(array, max);
  clock_t fine = clock();
  float tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
  printf("Selection: %2.3f\n", tempoEsecuzione);

  inizio = clock();
  i_sort(array2, max);
  fine = clock();
  tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
  printf("Insertion: %2.3f\n", tempoEsecuzione);
  return 0;
}

The code contains references to a insertion sort function that I haven't included in the rest of the question because (as expected) java run slower that c.

解决方案

And this is what I really want to ask.

How is it possible for my program to run so fast if it has to write files and collect profiling information and instead it runs 1.5 times slower when it hasn't?

Yes, that's the real question here. Mentioning all that Java comparison stuff just adds noise.

I could reproduce the weird behavior on my machine with gcc 4.7.2. Not surprisingly, the hot path of the code is the inner for loop:

for (j = i + 1; j < size; j++) {
  if (a[j] < a[lowerElementIndex]) {
    lowerElementIndex = j;
}

The only relevant difference in the corresponding generated assembly code is:

Fast case:

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case:

cmpl    %ecx, %esi
cmovl   %edx, %edi
cmovl   %esi, %ecx

The first case (fast) can greatly benefit from branch prediction but the other (slow case) apparently cannot. Sorted or randomly shuffled arrays do not cause too much branch mispredictions. The first code snippet is optimal in that case.

As it turns out, it is actually difficult to create a dataset that causes a lot of branch mispredictions in selection sort. (It was pointed out by Yakk; see also my attempts to create an evil dataset; so far, I failed to create one.)

The -fprofile-arcs happens to disable tree vectorization which seems to be responsible for generating the slow case code. A better way to disable tree vectorization is to pass the -fno-tree-vectorize flag.

clang 3.4 also generates the fast case code, without any special flag. The Java code without warm up runs 2.4x slower than the C code. (Since it wasn't the question, I haven't looked into improving the Java code performance.)

这篇关于ç性能和编译选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆