检查最快的方式,如果一个字节数组是全零 [英] Fastest way to check if a byte array is all zeros

查看:178
本文介绍了检查最快的方式,如果一个字节数组是全零的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字节[4096] 和不知道的最快方法是什么,检查所有值都为零?

有什么办法不是做快:

 字节[] B =新的字节[4096];
B〔4095〕= 1;
的for(int i = 0; I< b.length个;我++)
    如果(二[I]!= 0)
        返回false; // 不是空的


解决方案

我已经重写这个答案,因为我是第一次汇总所有字节,这是不正确不过作为Java已经符号字节,所以我需要或。此外,我已经改变了JVM热身,现在是正确的。

您最好的选择真的是简单地对所有值循环。

我想你有三个主要选项:


  1. 或者所有元素,并检查总和。

  2. 请网点比较。

  3. 请比较一个分支。

我不知道性能如何好是添加使用Java(低级别的性能)字节,我知道Java使用(低电平)分支predictors如果你给支比较。

因此​​,我期待下面的情况发生:

  byte []数组=新的字节[4096];
对于(BYTE B:数组){
    如果(二!= 0){
        返回false;
    }
}


  1. 在最初的几个迭代相对较慢比较时,分支predictor仍在播种本身。

  2. 非常快的分支比较,由于分支prediction为每个值应为零反正。

如果它会打一个非零值,然后在该分支predictor会失败,从而导致比较慢了下来,但你也是在你计算结束,只要你想返回false要么办法。我认为一个人失败的分支prediction的成本是一个量级较小的持续迭代这个数组的成本。

我另外的认为的是为(BYTE B:数组)应该被允许,因为它应该得到直接编译到索引数组迭代为远因为我知道有,因为没有这样的事情 PrimitiveArrayIterator 这将导致一些额外的方法调用(如循环访问列表),直至code被内联。

更新

我写我自己的基准这给了一些有趣的结果...不幸的是我不能使用任何现有基准工具,因为它们是pretty很难得到正确安装。

我也决定把选择1和2在一起,因为我认为他们实际上是相同的,与网点你平时或一切(减去条件),然后检查的最终结果。这里的条件是 X> 0 ,因此一个零或者是一个空操作presumably。

在code:

 公共类基准{
    私人无效的start(){
        //设置字节数组
        清单<字节[]>阵列= createByteArrays(700_000);        //热身和重复标杆
        arrays.forEach(此:: byteArrayCheck12);
        基准(数组,这个:: byteArrayCheck12,byteArrayCheck12);        arrays.forEach(此:: byteArrayCheck3);
        基准(数组,这个:: byteArrayCheck3,byteArrayCheck3);        arrays.forEach(此:: byteArrayCheck4);
        基准(数组,这个:: byteArrayCheck4,byteArrayCheck4);        arrays.forEach(此:: byteArrayCheck5);
        基准(数组,这个:: byteArrayCheck5,byteArrayCheck5);
    }    私人无效基准(最终名单,LT;字节[]>阵列,最终消费与LT;字节[]>的方法,最终字符串名称){
        长启动= System.nanoTime();
        arrays.forEach(方法);
        长端= System.nanoTime();
        双nanosecondsPerIteration =(结束 - 开始)* 1D / arrays.size();
        的System.out.println(基准+姓名+/迭代:+ arrays.size()+每次迭代/时间:+ nanosecondsPerIteration +NS);
    }    私人列表<字节[]> createByteArrays(最终诠释量){
        随机随机=新的随机();
        清单<字节[]> resultList =新的ArrayList<>();
        的for(int i = 0; I<金额;我++){
            字节[]的字节数组=新的字节[4096];
            的字节数组[random.nextInt(4096)] = 1;
            resultList.add(字节阵列);
        }
        返回resultList;
    }    私人布尔byteArrayCheck12(最终byte []数组){
        INT总和= 0;
        对于(BYTE B:数组){
            总之| = B;
        }
        返回(总和== 0);
    }    私人布尔byteArrayCheck3(最终byte []数组){
        对于(BYTE B:数组){
            如果(二!= 0){
                返回false;
            }
        }
        返回true;
    }    私人布尔byteArrayCheck4(最终byte []数组){
        返回(IntStream.range(0,array.length).MAP(ⅰ - >阵列[I])减少(0,(A,B) - >一种| B)= 0。!);
    }    私人布尔byteArrayCheck5(最终byte []数组){
        返回IntStream.range(0,array.length).MAP(I - >阵[I])。anyMatch(I - >!I = 0);
    }    公共静态无效的主要(字串[] args){
        新的基准()开始();
    }
}

令人惊讶的结果:


  

基准:byteArrayCheck12 /迭代:每次迭代70万/时间:50.18817142857143ns结果
  基准:byteArrayCheck3 /迭代:每次迭代70万/时间:767.7371985714286ns结果
  基准:byteArrayCheck4 /迭代:每次迭代70万/时间:21145.03219857143ns结果
  基准:byteArrayCheck5 /迭代:每次迭代70万/时间:10376.119144285714ns


这表明orring是一个整体的地段比分支predictor,这是相当令人惊讶的速度更快,所以我想一些低级别的优化正在做的。

作为额外的我已经包括了流的变种,这是我没想到的是那么快进不去。

在跑了股票主频的英特尔i7-3770,16GB 1600MHz的内存。

所以,我认为最后的答案是:这要看情况。这取决于你要多少次连续检查数组。在byteArrayCheck3的解决办法总是在稳步700〜了800ns。

后续更新

东西实际上采取另一种有趣的方法,原来的JIT优化了几乎所有的计算路程,由于产生不被使用在所有的变量。

因此​​,我有以下新的基准方法:

 私人无效基准(最终名单,LT;字节[]>阵列,最终predicate<字节[]>的方法,最终字符串名称){
    长启动= System.nanoTime();
    布尔someUnrelatedResult = FALSE;
    对于(byte []数组:数组){
        someUnrelatedResult | = method.test(数组);
    }
    长端= System.nanoTime();
    双nanosecondsPerIteration =(结束 - 开始)* 1D / arrays.size();
    的System.out.println(结果:+ someUnrelatedResult);
    的System.out.println(基准+姓名+/迭代:+ arrays.size()+每次迭代/时间:+ nanosecondsPerIteration +NS);
}

这保证了基准的结果不能被优化掉,主要的问题,因此是在 byteArrayCheck12 方法是无效的,因为它注意到(总和== 0)没有被使用,因此它优化掉整个方法。

因此​​,我们有以下新的结果(省略清晰的结果打印):


  

基准:byteArrayCheck12 /迭代:每次迭代70万/时间:1370.6987942857143ns结果
  基准:byteArrayCheck3 /迭代:每次迭代70万/时间:736.1096242857143ns结果
  基准:byteArrayCheck4 /迭代:每次迭代70万/时间:20671.230327142857ns结果
  基准:byteArrayCheck5 /迭代:每次迭代70万/时间:9845.388841428572ns


因此​​,我们认为,我们终于可以得出结论,分支prediction胜。这可能但是也可以发生,因为早期的回报,作为平均违规字节将在字节数组的中间,因此它是时间不早返回的另一种方法:

 私人布尔byteArrayCheck3b(最终byte []数组){
    INT命中= 0;
    对于(BYTE B:数组){
        如果(二!= 0){
            命中++;
        }
    }
    返回(点击== 0);
}

在这种方式,我们仍然从分支prediction受益,但我们要确保我们不能提前返回。

而这又再次给了我们更多有趣的结果!


  

基准:byteArrayCheck12 /迭代:每次迭代70万/时间:1327.2817714285713ns结果
  基准:byteArrayCheck3 /迭代:每次迭代70万/时间:753.31376ns结果
  基准:byteArrayCheck3b /迭代:每次迭代70万/时间:1506.6772842857142ns结果
  基准:byteArrayCheck4 /迭代:每次迭代70万/时间:21655.950115714284ns结果
  基准:byteArrayCheck5 /迭代:每次迭代70万/时间:10608.70917857143ns


我认为我们能最终得出结论,最快的方法就是使用这两种早期的回报和分支prediction,其次是orring,其次是纯粹的分支prediction。我怀疑,所有这些操作都必须在本地code的高度优化。

更新后,一些额外的标杆使用长和int数组。

使用长[] ,看到建议后 INT [] 我决定这是值得调查。但是完全符合最初的答案行这些尝试可能不会了,不过还是会感兴趣的。

首先,我改变了基准方法使用泛型:

 私人< T>无效基准(最终名单< T>阵列,最终predicate< T>的方法,最终字符串名称){
    长启动= System.nanoTime();
    布尔someUnrelatedResult = FALSE;
    对于(T数组:数组){
        someUnrelatedResult | = method.test(数组);
    }
    长端= System.nanoTime();
    双nanosecondsPerIteration =(结束 - 开始)* 1D / arrays.size();
    的System.out.println(结果:+ someUnrelatedResult);
    的System.out.println(基准+姓名+/迭代:+ arrays.size()+每次迭代/时间:+ nanosecondsPerIteration +NS);
}

然后我从进行转换的byte [] 长[] INT [ ] 分别为之前的基准,它也neccessary的最大堆大小设置为10 GB。

 列表<长[]> longArrays = arrays.stream()图(ByteArray  - 要将方式> {
    长[] = longArray新长[八分之四千零九十六]
    ByteBuffer.wrap(字节阵列).asLongBuffer()获得(longArray)。
    返回longArray;
})收集(Collectors.toList());
longArrays.forEach(此:: byteArrayCheck8);
基准(longArrays,此:: byteArrayCheck8byteArrayCheck8);清单< INT [] GT; intArrays = arrays.stream()图(ByteArray - 要将方式> {
    INT [] intArray =新INT [四分之四千○九十六]
    ByteBuffer.wrap(字节阵列).asIntBuffer()获得(intArray)。
    返回intArray;
})收集(Collectors.toList());
intArrays.forEach(此:: byteArrayCheck9);
基准(intArrays,此:: byteArrayCheck9byteArrayCheck9);私人布尔byteArrayCheck8(最终长[]数组){
    为(长L:数组){
        如果(升!= 0){
            返回false;
        }
    }
    返回true;
}私人布尔byteArrayCheck9(最终诠释[]数组){
    对于(INT I:数组){
        如果(ⅰ!= 0){
            返回false;
        }
    }
    返回true;
}

,得到以下的结果:


  

基准:byteArrayCheck8 /迭代:每次迭代70万/时间:259.8157614285714ns结果
  基准:byteArrayCheck9 /迭代:每次迭代70万/时间:266.38013714285717ns


如果这是可能得到的字节在这种格式此路径可能是值得探讨的。做基准比较法里面的转换然而,当的时间是大约每次迭代2000纳秒,所以当你需要自己做转换是不值得的。

I have a byte[4096] and was wondering what the fastest way is to check if all values are zero?

Is there any way faster than doing:

byte[] b = new byte[4096];
b[4095] = 1;
for(int i=0;i<b.length;i++)
    if(b[i] != 0)
        return false; // Not Empty

解决方案

I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. Also I have changed the JVM warmup to be correct now.

Your best bet really is to simply loop over all values.

I suppose you have three major options available:

  1. Or all elements and check the sum.
  2. Do branchless comparisons.
  3. Do comparisons with a branch.

I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons.

Therefore I expect the following to happen on:

byte[] array = new byte[4096];
for (byte b : array) {
    if (b != 0) {
        return false;
    }
}

  1. Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself.
  2. Very fast branch comparisons due to branch prediction as every value should be zero anyway.

If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array.

I furthermore believe that for (byte b : array) should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIterator which would cause some extra method calls (as iterating over a list) until the code gets inlined.

Update

I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly.

I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. And the condition here is x > 0 and hence a or of zero is a noop presumably.

The code:

public class Benchmark {
    private void start() {
        //setup byte arrays
        List<byte[]> arrays = createByteArrays(700_000);

        //warmup and benchmark repeated
        arrays.forEach(this::byteArrayCheck12);
        benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");

        arrays.forEach(this::byteArrayCheck3);
        benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");

        arrays.forEach(this::byteArrayCheck4);
        benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");

        arrays.forEach(this::byteArrayCheck5);
        benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck12(final byte[] array) {
        int sum = 0;
        for (byte b : array) {
            sum |= b;
        }
        return (sum == 0);
    }

    private boolean byteArrayCheck3(final byte[] array) {
        for (byte b : array) {
            if (b != 0) {
                return false;
            }
        }
        return true;
    }

    private boolean byteArrayCheck4(final byte[] array) {
        return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
    }

    private boolean byteArrayCheck5(final byte[] array) {
        return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
    }

    public static void main(String[] args) {
        new Benchmark().start();
    }
}

The surprising results:

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns

This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done.

As extra I've included the stream variants, which I did not expect to be that fast anyhow.

Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM.

So I think the final answer is: It depends. It depends on how many times you are going to check the array consecutively. The "byteArrayCheck3" solution is always steadily at 700~800ns.

Follow up update

Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all.

Thus I have the following new benchmark method:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12 method was void, as it noticed that the (sum == 0) was not being used, hence it optimized away the entire method.

Thus we have the following new result (omitted the result prints for clarity):

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns

Hence we think that we can finally conclude that branch prediction wins. It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early:

private boolean byteArrayCheck3b(final byte[] array) {
    int hits = 0;
    for (byte b : array) {
        if (b != 0) {
            hits++;
        }
    }
    return (hits == 0);
}

In this way we still benefit from the branch prediction, however we make sure that we cannot return early.

Which in turn gives us more interesting results again!

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns
Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns

I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. I suspect that all of those operations are highly optimized in native code.

Update, some additional benchmarking using long and int arrays.

After seeing suggestions on using long[] and int[] I decided it was worth investigating. However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting.

Firstly, I changed the benchmark method to use generics:

private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (T array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Then I performed conversions from byte[] to long[] and int[] respectively before the benchmarks, it was also neccessary to set the maximum heap size to 10 GB.

List<long[]> longArrays = arrays.stream().map(byteArray -> {
    long[] longArray = new long[4096 / 8];
    ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
    return longArray;
}).collect(Collectors.toList());
longArrays.forEach(this::byteArrayCheck8);
benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");

List<int[]> intArrays = arrays.stream().map(byteArray -> {
    int[] intArray = new int[4096 / 4];
    ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
    return intArray;
}).collect(Collectors.toList());
intArrays.forEach(this::byteArrayCheck9);
benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");

private boolean byteArrayCheck8(final long[] array) {
    for (long l : array) {
        if (l != 0) {
            return false;
        }
    }
    return true;
}

private boolean byteArrayCheck9(final int[] array) {
    for (int i : array) {
        if (i != 0) {
            return false;
        }
    }
    return true;
}

Which gave the following results:

Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns
Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns

This path may be worth exploring if it is possibly to get the bytes in such format. However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself.

这篇关于检查最快的方式,如果一个字节数组是全零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆