为什么ByteBuffer.allocate()和ByteBuffer.allocateDirect()之间的奇怪性能曲线差异 [英] Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()

查看:130
本文介绍了为什么ByteBuffer.allocate()和ByteBuffer.allocateDirect()之间的奇怪性能曲线差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些 SocketChannel -to - SocketChannel 代码,这对于直接字节缓冲区效果最佳 - 长寿和大(每个连接几十到几百兆。)在使用 FileChannel s散列出精确的循环结构时,我在<$ c上运行了一些微基准测试$ c> ByteBuffer.allocate()与 ByteBuffer.allocateDirect()表现。

I'm working on some SocketChannel-to-SocketChannel code which will do best with a direct byte buffer--long lived and large (tens to hundreds of megabytes per connection.) While hashing out the exact loop structure with FileChannels, I ran some micro-benchmarks on ByteBuffer.allocate() vs. ByteBuffer.allocateDirect() performance.

结果出人意料,我无法解释。在下图中,对于 ByteBuffer.allocate()传输实现,256KB和512KB处有一个非常明显的悬崖 - 性能下降了~50%!对于 ByteBuffer.allocateDirect(),似乎也是一个较小的性能悬崖。 (%-gain系列有助于可视化这些变化。)

There was a surprise in the results that I can't really explain. In the below graph, there is a very pronounced cliff at the 256KB and 512KB for the ByteBuffer.allocate() transfer implementation--the performance drops by ~50%! There also seem sto be a smaller performance cliff for the ByteBuffer.allocateDirect(). (The %-gain series helps to visualize these changes.)

缓冲区大小(字节)与时间(MS)

为什么 ByteBuffer.allocate() ByteBuffer.allocateDirect() 幕后究竟发生了什么?

Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()? What exactly is going on behind the curtain?

很可能硬件和操作系统依赖,所以这些是详情:

It very well maybe hardware and OS dependent, so here are those details:


  • MacBook Pro配双核Core 2 CPU

  • Intel X25M SSD驱动器

  • OSX 10.6.4

源代码,按要求:

package ch.dietpizza.bench;

import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;

public class SocketChannelByteBufferExample {
    private static WritableByteChannel target;
    private static ReadableByteChannel source;
    private static ByteBuffer          buffer;

    public static void main(String[] args) throws IOException, InterruptedException {
        long timeDirect;
        long normal;
        out.println("start");

        for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
            buffer = allocateDirect(i);
            timeDirect = copyShortest();

            buffer = allocate(i);
            normal = copyShortest();

            out.println(format("%d, %d, %d", i, normal, timeDirect));
        }

        out.println("stop");
    }

    private static long copyShortest() throws IOException, InterruptedException {
        int result = 0;
        for (int i = 0; i < 100; i++) {
            int single = copyOnce();
            result = (i == 0) ? single : Math.min(result, single);
        }
        return result;
    }


    private static int copyOnce() throws IOException, InterruptedException {
        initialize();

        long start = System.currentTimeMillis();

        while (source.read(buffer)!= -1) {    
            buffer.flip();  
            target.write(buffer);
            buffer.clear();  //pos = 0, limit = capacity
        }

        long time = System.currentTimeMillis() - start;

        rest();

        return (int)time;
    }   


    private static void initialize() throws UnknownHostException, IOException {
        InputStream  is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
        OutputStream os = new FileOutputStream(new File("/dev/null"));

        target = Channels.newChannel(os);
        source = Channels.newChannel(is);
    }

    private static void rest() throws InterruptedException {
        System.gc();
        Thread.sleep(200);      
    }
}


推荐答案

如何ByteBuffer的工作原理以及为什么Direct(Byte)Buffers是现在唯一真正有用的。



首先我有点惊讶它不是常识但是我不知道

直接字节缓冲区分配java堆外的地址。

Direct byte buffers allocate an address outside the java heap.

这一点至关重要:所有OS(和本机C)函数都可以利用该地址,无需锁定堆上的对象并复制数据。复制的简短示例:为了通过Socket.getOutputStream()。write(byte [])发送任何数据,本机代码必须锁定byte [],将其复制到java堆外部,然后调用OS函数,例如: 发送。副本在堆栈上执行(对于较小的byte [])或通过malloc / free对较大的执行。
DatagramSockets没有什么不同,它们也会复制 - 除了它们被限制为64KB并在堆栈上分配,如果线程堆栈不够大或递归深度,它甚至可以终止进程。
注意:锁定会阻止JVM / GC在堆周围移动/重新分配对象

This is utmost importance: all OS (and native C) functions can utilize that address w/o locking the object on the heap and copying the data. Short example on copying: in order to send any data via Socket.getOutputStream().write(byte[]) the native code has to "lock" the byte[], copy it outside java heap and then call the OS function, e.g. send. The copy is performed either on the stack (for smaller byte[]) or via malloc/free for larger ones. DatagramSockets are no different and they also copy - except they are limited to 64KB and allocated on the stack which can even kill the process if the thread stack is not large enough or deep in recursion. note: locking prevents JVM/GC to move/reallocate the object around the heap

因此引入了NIO想法是避免复制和流量流水线/间接的大量。在数据到达目的地之前,通常有3-4种缓冲类型的流。 (yay波兰用漂亮的镜头均衡(!))
通过引入直接缓冲区,java可以直接与C本机代码进行通信,无需任何锁定/复制。因此发送函数可以获取缓冲区的地址添加位置,并且性能与本机C相同。
这是关于直接缓冲区。

So w/ the introduction of NIO the idea was avoid the copy and multitudes of stream pipelining/indirection. Often there are 3-4 buffered type of streams before the data reaches its destination. (yay Poland equalizes(!) with a beautiful shot) By introducing the direct buffers java could communicate straight to C native code w/o any locking/copy necessary. Hence sent function can take the address of the buffer add the position and the performance is much the same as native C. That's about the direct buffer.

带有直接缓冲区的主要问题 - 费用昂贵分配和昂贵的解除分配和使用相当繁琐,没有像byte []。

The main issue w/ direct buffers - they are expensive to allocate and expensive to deallocate and quite cumbersome to use, nothing like byte[].

非直接缓冲区不提供直接缓冲区的真正本质 - 即直接桥接到本机/操作系统而不是它们是轻量级的并且共享完全相同的API - 甚至更多,它们可以包装byte [] 甚至它们的后备阵列是可以直接操纵 - 什么不爱?好吧,他们必须被复制!

Non-direct buffer do not offer the true essence the direct buffers do - i.e. direct bridge to the native/OS instead they are light-weighted and share exactly the same API - and even more, they can wrap byte[] and even their backing array is available for direct manipulation - what not to love? Well they have to be copied!

那么Sun / Oracle如何处理非直接缓冲区,因为操作系统/本机不能使用'em - 好吧,天真。当使用非直接缓冲区时,必须创建直接计数器部分。该实现非常智能,可以使用 ThreadLocal 并通过 SoftReference *缓存一些直接缓冲区,以避免创建的巨大成本。天真的部分在复制它们时会出现 - 它每次都会尝试复制整个缓冲区( remaining())。

So how does Sun/Oracle handles non-direct buffers as the OS/native can't use 'em - well, naively. When a non-direct buffer is used a direct counter part has to be created. The implementation is smart enough to use ThreadLocal and cache a few direct buffers via SoftReference* to avoid the hefty cost of creation. The naive part comes when copying them - it attempts to copy the entire buffer (remaining()) each time.

现在想象一下:512 KB非直接缓冲区转到64 KB套接字缓冲区,套接字缓冲区不会超过它的大小。所以第一次512 KB将从非直接复制到线程本地直接,但只使用64 KB。下次将复制512-64 KB,但仅使用64 KB,第三次将复制512-64 * 2 KB,但只会使用64 KB,依此类推...而且乐观地总是套接字缓冲区将完全为空。因此,您不仅要复制 n KB,还要 n × n ÷ m n = 512, m = 16(套接字缓冲区留下的平均空间))。

Now imagine: 512 KB non-direct buffer going to 64 KB socket buffer, the socket buffer won't take more than its size. So the 1st time 512 KB will be copied from non-direct to thread-local-direct, but only 64 KB of which will be used. The next time 512-64 KB will be copied but only 64 KB used, and the third time 512-64*2 KB will be copied but only 64 KB will be used, and so on... and that's optimistic that always the socket buffer will be empty entirely. So you are not only copying n KB in total, but n × n ÷ m (n = 512, m = 16 (the average space the socket buffer has left)).

复制部分是所有非直接缓冲区的公共/抽象路径,因此实现永远不会知道目标容量。复制会破坏缓存,而不会减少内存带宽等。

The copying part is a common/abstract path to all non-direct buffer, so the implementation never knows the target capacity. Copying trashes the caches and what not, reduces the memory bandwidth, etc.

* 关于SoftReference缓存的说明:它取决于GC实现和经验可以有所不同。 Sun的GC使用空闲堆内存来确定SoftRefences的生命周期,当它们被释放时会导致一些尴尬的行为 - 应用程序需要再次分配先前缓存的对象 - 即更多分配(直接ByteBuffers占用堆中的一小部分,所以至少他们不会影响额外的缓存垃圾但反而受到影响)

我的拇指规则 - 汇集了使用套接字读/写缓冲区调整大小的直接缓冲区。操作系统永远不会复制超过必要的。

My rule of the thumb - a pooled direct buffer sized with the socket read/write buffer. The OS never copies more than necessary.

这个微基准测试主要是内存吞吐量测试,操作系统将文件完全放在缓存中,所以它主要测试 memcpy 。一旦缓冲区用完L2缓存,性能下降就会明显。同样运行基准也会增加和累积GC收集成本。 ( rest()不会收集软引用的ByteBuffers)

This micro-benchmark is mostly memory throughput test, the OS will have the file entirely in cache, so it mostly tests memcpy. Once the buffers run out of the L2 cache the drop of performance is to be noticeable. Also running the benchmark like that imposes increasing and accumulated GC collection costs. (rest() will not collect the soft-referenced ByteBuffers)

这篇关于为什么ByteBuffer.allocate()和ByteBuffer.allocateDirect()之间的奇怪性能曲线差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆