Java高负载NIO TCP服务器 [英] Java High-load NIO TCP server

查看:103
本文介绍了Java高负载NIO TCP服务器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我研究的一部分,我正在用Java编写一个高负载的TCP / IP echo服务器。我想为大约3-4k的客户端提供服务,并且每秒可以看到我可以挤出的最大可能消息。消息大小非常小 - 最多100个字节。这项工作没有任何实际意义 - 只是一项研究。

As a part of my research I'm writing an high-load TCP/IP echo server in Java. I want to serve about 3-4k of clients and see the maximum possible messages per second that I can squeeze out of it. Message size is quite small - up to 100 bytes. This work doesn't have any practical purpose - only a research.

根据我见过的众多演讲(HornetQ基准测试,LMAX Disruptor会谈等),真实世界的高负载系统往往每秒服务数百万次交易(我相信Disruptor提到大约6密尔和大黄蜂 - 8.5)。例如,这篇文章表明可以实现高达40M MPS。所以我把它作为现代硬件应该具备的粗略估计。

According to numerous presentations that I've seen (HornetQ benchmarks, LMAX Disruptor talks, etc), real-world high-load systems tend to serve millions of transactions per second (I believe Disruptor mentioned about 6 mils and and Hornet - 8.5). For example, this post states that it possible to achieve up to 40M MPS. So I took it as a rough estimate of what should modern hardware be capable of.

我编写了最简单的单线程NIO服务器并启动了负载测试。我很惊讶我在本地主机上只能获得大约10万MPS,在实际网络中只能获得25k MPS。数字看起来很小。我正在测试Win7 x64,核心i7。查看CPU负载 - 只有一个核心正忙(在单线程应用程序上预期),而其余核心处于空闲状态。然而,即使我加载所有8个核心(包括虚拟),我的MPS也不会超过800k - 甚至不会接近4千万:)

I wrote simplest single-threaded NIO server and launched a load test. I was little surprised that I can get only about 100k MPS on localhost and 25k with actual networking. Numbers look quite small. I was testing on Win7 x64, core i7. Looking at CPU load - only one core is busy (which is expected on a single-threaded app), while the rest sit idle. However even if I load all 8 cores (including virtual) I will have no more than 800k MPS - not even close to 40 millions :)

我的问题是:什么是向客户提供大量消息的典型模式?我应该在单个JVM内的几个不同套接字上分配网络负载,并使用某种负载均衡器(如HAProxy)将负载分配到多个内核吗?或者我应该考虑在我的NIO代码中使用多个选择器?或者甚至可能在多个JVM之间分配负载并使用Chronicle在它们之间建立进程间通信?在像CentOS这样适当的服务器端操作系统上进行测试会产生很大的不同(可能是Windows会降低速度)吗?

My question is: what is a typical pattern for serving massive amounts of messages to clients? Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores? Or I should look towards using multiple Selectors in my NIO code? Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them? Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?

以下是我服务器的示例代码。对于任何传入的数据,它总是以ok回答。我知道在现实世界中,我需要跟踪消息的大小,并准备好一条消息可能在多次读取之间分配,但我现在想让事情变得非常简单。

Below is the sample code of my server. It always answers with "ok" to any incoming data. I know that in real world I'd need to track the size of the message and be prepared that one message might be split between multiple reads however I'd like to keep things super-simple for now.

public class EchoServer {

private static final int BUFFER_SIZE = 1024;
private final static int DEFAULT_PORT = 9090;

// The buffer into which we'll read data when it's available
private ByteBuffer readBuffer = ByteBuffer.allocate(BUFFER_SIZE);

private InetAddress hostAddress = null;

private int port;
private Selector selector;

private long loopTime;
private long numMessages = 0;

public EchoServer() throws IOException {
    this(DEFAULT_PORT);
}

public EchoServer(int port) throws IOException {
    this.port = port;
    selector = initSelector();
    loop();
}

private void loop() {
    while (true) {
        try{
            selector.select();
            Iterator<SelectionKey> selectedKeys = selector.selectedKeys().iterator();
            while (selectedKeys.hasNext()) {
                SelectionKey key = selectedKeys.next();
                selectedKeys.remove();

                if (!key.isValid()) {
                    continue;
                }

                // Check what event is available and deal with it
                if (key.isAcceptable()) {
                    accept(key);
                } else if (key.isReadable()) {
                    read(key);
                } else if (key.isWritable()) {
                    write(key);
                }
            }

        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
}

private void accept(SelectionKey key) throws IOException {
    ServerSocketChannel serverSocketChannel = (ServerSocketChannel) key.channel();

    SocketChannel socketChannel = serverSocketChannel.accept();
    socketChannel.configureBlocking(false);
    socketChannel.setOption(StandardSocketOptions.SO_KEEPALIVE, true);
    socketChannel.setOption(StandardSocketOptions.TCP_NODELAY, true);
    socketChannel.register(selector, SelectionKey.OP_READ);

    System.out.println("Client is connected");
}

private void read(SelectionKey key) throws IOException {
    SocketChannel socketChannel = (SocketChannel) key.channel();

    // Clear out our read buffer so it's ready for new data
    readBuffer.clear();

    // Attempt to read off the channel
    int numRead;
    try {
        numRead = socketChannel.read(readBuffer);
    } catch (IOException e) {
        key.cancel();
        socketChannel.close();

        System.out.println("Forceful shutdown");
        return;
    }

    if (numRead == -1) {
        System.out.println("Graceful shutdown");
        key.channel().close();
        key.cancel();

        return;
    }

    socketChannel.register(selector, SelectionKey.OP_WRITE);

    numMessages++;
    if (numMessages%100000 == 0) {
        long elapsed = System.currentTimeMillis() - loopTime;
        loopTime = System.currentTimeMillis();
        System.out.println(elapsed);
    }
}

private void write(SelectionKey key) throws IOException {
    SocketChannel socketChannel = (SocketChannel) key.channel();
    ByteBuffer dummyResponse = ByteBuffer.wrap("ok".getBytes("UTF-8"));

    socketChannel.write(dummyResponse);
    if (dummyResponse.remaining() > 0) {
        System.err.print("Filled UP");
    }

    key.interestOps(SelectionKey.OP_READ);
}

private Selector initSelector() throws IOException {
    Selector socketSelector = SelectorProvider.provider().openSelector();

    ServerSocketChannel serverChannel = ServerSocketChannel.open();
    serverChannel.configureBlocking(false);

    InetSocketAddress isa = new InetSocketAddress(hostAddress, port);
    serverChannel.socket().bind(isa);
    serverChannel.register(socketSelector, SelectionKey.OP_ACCEPT);
    return socketSelector;
}

public static void main(String[] args) throws IOException {
    System.out.println("Starting echo server");
    new EchoServer();
}
}


推荐答案

what is a typical pattern for serving massive amounts of messages to clients?

有许多可能的模式:
一种简单的方法来利用所有核心而无需通过多个jvms是:

There are many possible patterns: An easy way to utilize all cores without going through multiple jvms is:


  1. 让一个线程接受连接并使用选择器读取。

  2. 一旦你有足够的字节构成单个消息,使用像环形缓冲区这样的构造将其传递给另一个核心。 Disruptor Java框架非常适合这种情况。如果需要知道什么是完整消息的处理是轻量级的,那么这是一个很好的模式。例如,如果你有一个长度前缀协议,你可以等到你得到预期的字节数,然后将其发送到另一个线程。如果协议的解析非常繁重,那么你可能会压倒这个单一的线程,阻止它接受连接或读取网络的字节。

  3. 在你的工作线程上接收数据来自环形缓冲区,进行实际处理。

  4. 您可以在工作线程上或通过其他聚合器线程写出响应。

  1. Have a single thread accept connections and read using a selector.
  2. Once you have enough bytes to constitute a single message, pass it on to another core using a construct like a ring buffer. The Disruptor Java framework is a good match for this. This is a good pattern if the processing needed to know what is a complete message is lightweight. For example if you have a length prefixed protocol you could wait till you get the expected number of bytes and then send it to another thread. If the parsing of the protocol is very heavy then you might overwhelm this single thread preventing it from accepting connections or reading bytes of the network.
  3. On your worker thread(s), which receive data from a ring buffer, do the actual processing.
  4. You write out the responses either on your worker threads or through some other aggregator thread.

这就是它的要点。这里有更多的可能性,答案实际上取决于您正在编写的应用程序类型。一些例子是:

That's the gist of it. There are many more possibilities here and the answer really depends on the type of application you are writing. A few examples are:


  1. CPU重型无状态应用程序表示图像处理应用程序。每个请求完成的CPU / GPU工作量可能会显着高于非常天真的线程间通信解决方案所产生的开销。在这种情况下,一个简单的解决方案是从一个队列中拉出工作的一堆工作线程。请注意,这是一个单个队列,而不是每个worker的一个队列。优点是这本身就是负载平衡的。每个工作人员完成它的工作,然后只轮询单生产者多用户队列。即使这是一个争用的来源,图像处理工作(秒?)应该比任何同步替代品贵得多。

  2. 纯IO应用程序例如一个统计服务器,它只是增加了一些请求的计数器:这里几乎没有CPU繁重的工作。大多数工作只是读取字节和写入字节。多线程应用程序可能不会给您带来显着的好处。事实上,如果排队项目所花费的时间超过处理项目所需的时间,它甚至可能会减慢速度。单线程Java服务器应该能够轻松地使1G链路饱和。

  3. 有状态应用程序,需要适量的处理,例如:典型的业务应用程序:此处每个客户端都有一些状态,用于确定每个请求的处理方式。假设我们进入多线程,因为处理非常重要,我们可以将客户端关联到某些线程。这是actor体系结构的变体:

  1. A CPU heavy stateless application say an image processing application. The amount of CPU/GPU work done per request will probably be significantly higher than the overhead generated by a very naive inter-thread communication solution. In this case an easy solution is a bunch of worker threads pulling work from a single queue. Notice how this is a single queue instead of one queue per worker. The advantage is this is inherently load balanced. Each worker finishes it's work and then just polls the single-producer multiple-consumer queue. Even though this is a source of contention, the image-processing work (seconds?) should be far more expensive than any synchronization alternative.
  2. A pure IO application e.g. a stats server which just increments some counters for a request: Here you do almost no CPU heavy work. Most of the work is just reading bytes and writing bytes. A multi-threaded application might not give you significant benefit here. In fact it might even slow things down if the time it takes to queue items is more than the time it takes to process them. A single threaded Java server should be able to saturate a 1G link easily.
  3. Stateful applications which require moderate amounts of processing e.g. a typical business application: Here every client has some state that determines how each request is handled. Assuming we go multi-threaded since the processing is non-trivial, we could affinitize clients to certain threads. This is a variant of the actor architecture:

i)当客户端首次将哈希值连接到工作者时。您可能希望使用某个客户端ID执行此操作,因此如果它断开连接并重新连接,它仍然会分配给同一个worker / actor。

i) When a client first connects hash it to a worker. You might want to do this with some client id, so that if it disconnects and reconnects it is still assigned to the same worker/actor.

ii)当读者线程时读取完整的请求将其放在正确的worker / actor的ring-buffer上。由于同一个工作程序总是处理特定的客户端,所以所有状态都应该是线程本地的,使所有处理逻辑变得简单和单线程。

ii) When the reader thread reads a complete request put it on the ring-buffer for the right worker/actor. Since the same worker always processes a particular client all the state should be thread local making all the processing logic simple and single-threaded.

iii)工作线程可以写请求出。总是尝试做一个write()。如果您的所有数据都无法写出,那么您是否注册了OP_WRITE。如果实际存在未完成的事情,则工作线程只需要进行选择调用。大多数写入应该成功使这不必要。这里的技巧是在选择调用和轮询环形缓冲区之间进行平衡以获得更多请求。您还可以使用单个编写器线程,其唯一的责任是写出请求。每个工作线程都可以将它的响应放在一个环形缓冲区上,将它连接到这个单一的编写器线程。单个写入程序线程循环轮询每个传入的环形缓冲区并将数据写出到客户端。关于在select之前尝试写入的警告同样适用于关于在多个环形缓冲区和选择调用之间进行平衡的技巧。

iii) The worker thread can write requests out. Always attempt to just do a write(). If all your data could not be written out only then do you register for OP_WRITE. The worker thread only needs to make select calls if there is actually something outstanding. Most writes should just succeed making this unnecessary. The trick here is balancing between select calls and polling the ring buffer for more requests. You could also employ a single writer thread whose only responsibility is to write requests out. Each worker thread can put it's responses on a ring buffer connecting it to this single writer thread. The single writer thread round-robin polls each incoming ring-buffer and writes out the data to clients. Again the caveat about trying write before select applies as does the trick about balancing between multiple ring buffers and select calls.

正如您所指出的,还有许多其他选项:

As you point out there are many other options:

我应该在单个JVM内的几个不同套接字上分配网络负载并使用某种负载平衡器如HAProxy将负载分配到多个核心?

你可以这样做,但恕我直言这不是负载均衡器的最佳用途。这确实会为您购买独立的JVM,这些JVM可能会自行失败,但可能比编写多线程的单个JVM应用程序要慢。应用程序本身可能更容易编写,因为它将是单线程的。

You can do this, but IMHO this is not the best use for a load balancer. This does buy you independent JVMs that can fail on their own but will probably be slower than writing a single JVM app that is multi-threaded. The application itself might be easier to write though since it will be single threaded.

Or I should look towards using multiple Selectors in my NIO code?

您也可以这样做。看看Ngnix架构有关如何执行此操作的一些提示。

You can do this too. Look at Ngnix architecture for some hints on how to do this.

或者甚至可以在多个JVM之间分配负载并使用Chronicle构建一个inter它们之间的进程通信?
这也是一个选项。 Chronicle为您提供了一个优势,即内存映射文件对于中间退出的进程更具弹性。由于所有通信都是通过共享内存完成的,因此您仍然可以获得足够的性能。

Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them? This is also an option. Chronicle gives you an advantage that memory mapped files are more resilient to a process quitting in the middle. You still get plenty of performance since all communication is done through shared memory.

Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?

我不知道这个。不太可能。如果Java充分利用本机Windows API,那么它应该无关紧要。我非常怀疑4000万个事务/秒数字(没有用户空间网络堆栈+ UDP),但我列出的架构应该做得很好。

I don't know about this. Unlikely. If Java uses the native Windows APIs to the fullest, it shouldn't matter as much. I am highly doubtful of the 40 million transactions/sec figure (without a user space networking stack + UDP) but the architectures I listed should do pretty well.

这些体系结构倾向于因为它们是单作家架构,使用有界数组的数据结构进行线程间通信,所以做得很好。确定多线程是否是答案。在许多情况下,它不需要并且可能导致速度减慢。

These architectures tend to do well since they are single-writer architectures that use bounded array based data structures for inter thread communication. Determine if multi-threaded is even the answer. In many cases it is not needed and can lead to slowdown.

另一个需要研究的领域是内存分配方案。具体而言,分配和重用缓冲区的策略可以带来显着的好处。正确的缓冲区重用策略取决于应用程序。看看好友内存分配,竞技场分配等方案,看看它们是否能让你受益。 JVM GC对大多数工作负载都做得很好,但是在你走这条路线之前一直都要测量。

Another area to look into is memory allocation schemes. Specifically the strategy to allocate and reuse buffers could lead to significant benefits. The right buffer reuse strategy is dependent on application. Look at schemes like buddy-memory allocation, arena allocation etc to see if they can benefit you. The JVM GC does plenty fine for most work loads though so always measure before you go down this route.

协议设计对性能也有很大的影响。我倾向于选择长度前缀协议,因为它们允许您分配正确大小的缓冲区,避免缓冲区列表和/或缓冲区合并。长度前缀协议还可以轻松决定何时切换请求 - 只需检查 num bytes == expected 。实际的解析可以由工作者线程完成。序列化和反序列化扩展到长度前缀协议之外。在缓冲区而不是分配上使用flyweight模式的模式有助于此处。请参阅 SBE 了解其中一些原则。

Protocol design has a big effect on performance too. I tend to prefer length prefixed protocols because they let you allocate buffers of right sizes avoiding lists of buffers and/or buffer merging. Length prefixed protocols also make it easy to decide when to handover a request - just check num bytes == expected. The actual parsing can be done by the workers thread. Serialization and deserialization extends beyond length-prefixed protocols. Patterns like flyweight patterns over buffers instead of allocations helps here. Look at SBE for some of these principles.

正如你可以想象的那样,可以在这里写完整篇论文。这应该让你朝着正确的方向前进。警告:始终测量并确保您需要比最简单的选项更多的性能。很容易陷入永无止境的性能改进黑洞。

As you can imagine an entire treatise could be written here. This should set you in the right direction. Warning: Always measure and make sure you need more performance than the simplest option. It's easy to get sucked into a never ending black-hole of performance improvements.

这篇关于Java高负载NIO TCP服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆