一起找到最小和最大:算法应该更快,但不是 [英] Find both min and max together: algorithm should be faster but isn't

查看:51
本文介绍了一起找到最小和最大:算法应该更快,但不是的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现一种算法,以在文件的一组long中找到最小值和最大值.我的测试文件包含10亿个long.

I'm trying to implement an algorithm to find the minimum and the maximum values among a set of longs in a file. My test file contains one billion longs.

该算法可以按预期工作,但执行速度不会比朴素的版本快.由于天真的版本执行大约2n次比较,因此它应该明显更快,而此版本执行3n/2次比较.

The algorithm works as expected but does not perform faster than the naive version. It should be significantly faster as the naive version performs roughly 2n comparisons, whereas this version performs 3n/2 comparisons.

$ time ./findminmax_naive somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m24.156s
user    0m4.956s
sys     0m3.896s

$ time ./findminmax_faster somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m25.048s
user    0m6.948s
sys     0m3.980s

这是天真"版本:

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
            perror("fopen");
        readcount = 1024;
        if (ac == 3)
            readcount = atol(av[2]);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i++) {
                        count++;
                        if (n[i] < min)
                            min = n[i];
                        if (n[i] > max)
                            max = n[i];
                }
                if (feof(f))
                        break;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

以下是(可能是)更快"版本的代码:

Here is the code of the (should-be) "faster" version:

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
                perror("fopen");
        readcount = 1024;
        if (ac == 3)
                readcount = atol(av[2]);
        readcount = (readcount + 1) & (-1 << 1);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i += 2) {
                        count += 2;
                        if (n[i] < n[i + 1]) {
                                if (n[i] < min)
                                        min = n[i];
                                if (n[i + 1] > max)
                                        max = n[i + 1];
                        } else {
                                if (n[i + 1] < min)
                                        min = n[i + 1];
                                if (n[i] > max)
                                        max = n[i];
                        }
                }
                if (feof(f))
                        break;
        }
        if (rlen % 2) {
                if (n[rlen - 1] < min)
                        min = n[rlen - 1];
                if (n[rlen - 1] > max)
                        max = n[rlen - 1];
                count++;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

你知道我错过了什么吗?

Do you have any idea what I missed?

感谢您的帮助.

-杰里米(Jeremie)

-- Jeremie

推荐答案

关键是分支预测.除非文件按照病理性最坏情况的顺序排序,否则,朴素的版本将执行2n次分支,几乎每次都可以正确预测该分支.您的聪明"版本执行的n/2个分支几乎永远不会正确预测,并且还会执行另外n个可能正确预测的比较.

The key is branch prediction. Unless the file is sorted in a pathological worst-case order, the naive version will perform 2n branches that are predicted correctly almost every single time. Your "clever" version performs n/2 branches that are almost never predicted correctly, and an additional n comparisons that are likely to be predicted correctly.

错误预测的分支成本多少很大程度上取决于cpu体系结构,甚至取决于特定的cpu模型,但是至少我希望错误预测的分支成本要比正确预测的分支成本高好几倍.在极端情况下,正确预测分支的有效成本可能为零周期.

How much wrongly-predicted branches cost depends a lot on the cpu architecture and even the particular cpu model, but at the very least I would expect an incorrectly predicted branch to cost several times as much as a correctly predicted one. In an extreme case, correctly-predicted branches might have an effective cost of zero cycles.

作为一个有趣的例子,我最近对优化strlen进行了实验,发现孤立地显示一个非常幼稚的strlen-一次比较并分支一个字节-比聪明的矢量化方法要快.这几乎可以肯定是因为strlen具有特殊的属性,即每个分支直到最后一个分支总是可以正确预测的.

As an interesting example, I recently experimented with optimizing strlen, and found that in isolation an extremely naive unrolled strlen - comparing and branching on one byte at a time - was faster than the clever vectorized approaches. This is almost surely because strlen has the special property that every branch until the last one will always be predicted correctly.

顺便说一句,要检验我的假设,请尝试以下输入模式:

By the way, to test my hypothesis, try this input pattern:

999999999、1000000001、999999998、1000000002、999999997、1000000003,...

999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...

在您的聪明版本中,它将为朴素的算法提供最坏情况的分支预测,为外部条件给出最佳情况.

It will give worst-case branch prediction for the naive algorithm and best-case for the outer conditional on your clever version.

这篇关于一起找到最小和最大:算法应该更快,但不是的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆