close()时不关闭套接字正常 [英] close() is not closing socket properly

查看:256
本文介绍了close()时不关闭套接字正常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个处理大量的请求(高达500 /秒的一个节点),利用20个线程的多线程服务器(线程池)。有接受传入连接,他们排队的处理线程来处理一个监听线程。一旦响应准备好了,那么线程写出来给客户,并关闭套接字。一切似乎是罚款,直到最近,一个测试客户端程序开始读响应后随机挂。很多挖掘后,似乎接近()从服务器不实际断开插座。我已经添加了一些调试打印到code与文件描述符号码,我得到这个类型的输出。

I have a multi-threaded server (thread pool) that is handling a large number of requests (up to 500/sec for one node), using 20 threads. There's a listener thread that accepts incoming connections and queues them for the handler threads to process. Once the response is ready, the threads then write out to the client and close the socket. All seemed to be fine until recently, a test client program started hanging randomly after reading the response. After a lot of digging, it seems that the close() from the server is not actually disconnecting the socket. I've added some debugging prints to the code with the file descriptor number and I get this type of output.

Processing request for 21
Writing to 21
Closing 21

接近()的返回值是0,还是会有另一个打印调试语句。此输出与挂起了客户端后,lsof的是显示已建立的连接。

The return value of close() is 0, or there would be another debug statement printed. After this output with a client that hangs, lsof is showing an established connection.

SERVER 8160根21U的IPv4 32754237 TCP本地主机:9980->本地主机:47530(建立)

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (ESTABLISHED)

CLIENT 17747根12U的IPv4 32754228 TCP本地主机:47530->本地主机:9980(建立)

CLIENT 17747 root 12u IPv4 32754228 TCP localhost:47530->localhost:9980 (ESTABLISHED)

这是因为如果服务器永远不会发送的关机顺序到客户端,并且这种状态挂起,直到客户端被杀害,留下了服务器端的密切观望状态

It's as if the server never sends the shutdown sequence to the client, and this state hangs until the client is killed, leaving the server side in a close wait state

SERVER 8160根21U的IPv4 32754237 TCP本地主机:9980->本地主机:47530(CLOSE_WAIT)

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (CLOSE_WAIT)

此外,如果客户有指定超时时间,它就会超时,而不是挂。我也可以手动运行

Also if the client has a timeout specified, it will timeout instead of hanging. I can also manually run

call close(21)

从GDB的服务器,然后客户端将断开连接。出现这种情况也许一次50000的要求,但可能不会发生长时间。

in the server from gdb, and the client will then disconnect. This happens maybe once in 50,000 requests, but might not happen for extended periods.

Linux版的:2.6.21.7-2.fc8xen
CentOS的版本:5.4(最终)

Linux version: 2.6.21.7-2.fc8xen Centos version: 5.4 (Final)

插座操作如下

服务器:

INT client_socket;
结构SOCKADDR_IN client_addr;
socklen_t的client_len = sizeof的(client_addr);

int client_socket; struct sockaddr_in client_addr; socklen_t client_len = sizeof(client_addr);

while(true) {
  client_socket = accept(incoming_socket, (struct sockaddr *)&client_addr, &client_len);
  if (client_socket == -1)
    continue;
  /*  insert into queue here for threads to process  */
}

那么这个线程拿起插座并构建响应。

Then the thread picks up the socket and builds the response.

/*  get client_socket from queue  */

/*  processing request here  */

/*  now set to blocking for write; was previously set to non-blocking for reading  */
int flags = fcntl(client_socket, F_GETFL);
if (flags < 0)
  abort();
if (fcntl(client_socket, F_SETFL, flags|O_NONBLOCK) < 0)
  abort();

server_write(client_socket, response_buf, response_length);
server_close(client_socket);

server_write和server_close。

server_write and server_close.

void server_write( int fd, char const *buf, ssize_t len ) {
    printf("Writing to %d\n", fd);
    while(len > 0) {
      ssize_t n = write(fd, buf, len);
      if(n <= 0)
        return;// I don't really care what error happened, we'll just drop the connection
      len -= n;
      buf += n;
    }
  }

void server_close( int fd ) {
    for(uint32_t i=0; i<10; i++) {
      int n = close(fd);
      if(!n) {//closed successfully                                                                                                                                   
        return;
      }
      usleep(100);
    }
    printf("Close failed for %d\n", fd);
  }

客户端:

客户端使用的libcurl v 7.27.0

Client side is using libcurl v 7.27.0

CURL *curl = curl_easy_init();
CURLcode res;
curl_easy_setopt( curl, CURLOPT_URL, url);
curl_easy_setopt( curl, CURLOPT_WRITEFUNCTION, write_callback );
curl_easy_setopt( curl, CURLOPT_WRITEDATA, write_tag );

res = curl_easy_perform(curl);

没什么特别的,只是一个基本的卷曲连接。客户在tranfer.c(libcurl中)挂起,因为套接字没有认为被关闭。它在等待更多的数据从服务器中。

Nothing fancy, just a basic curl connection. Client hangs in tranfer.c (in libcurl) because the socket is not perceived as being closed. It's waiting for more data from the server.

事情到目前为止,我已经试过:

Things I've tried so far:

关机前关闭

shutdown(fd, SHUT_WR);                                                                                                                                            
char buf[64];                                                                                                                                                     
while(read(fd, buf, 64) > 0);                                                                                                                                         
/*  then close  */ 

设置SO_LINGER 1秒钟内强行关闭

Setting SO_LINGER to close forcibly in 1 second

struct linger l;
l.l_onoff = 1;
l.l_linger = 1;
if (setsockopt(client_socket, SOL_SOCKET, SO_LINGER, &l, sizeof(l)) == -1)
  abort();

这些都没有什么区别。任何想法将大大AP preciated。

These have made no difference. Any ideas would be greatly appreciated.

编辑 - 这最终是一个队列库中的一个线程安全问题导致插座被多个线程处理不当

EDIT -- This ended up being a thread-safety issue inside a queue library causing the socket to be handled inappropriately by multiple threads.

推荐答案

下面是一些code我已经在许多类Unix系统中使用(例如在SunOS 4,SGI IRIX,HPUX 10.20,CentOS 5的,Cygwin的)关闭套接字:

Here is some code I've used on many Unix-like systems (e.g SunOS 4, SGI IRIX, HPUX 10.20, CentOS 5, Cygwin) to close a socket:

int getSO_ERROR(int fd) {
   int err = 1;
   socklen_t len = sizeof err;
   if (-1 == getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len))
      FatalError("getSO_ERROR");
   if (err)
      errno = err;              // set errno to the socket SO_ERROR
   return err;
}

void closeSocket(int fd) {      // *not* the Windows closesocket()
   if (fd >= 0) {
      getSO_ERROR(fd); // first clear any errors, which can cause close to fail
      if (shutdown(fd, SHUT_RDWR) < 0) // secondly, terminate the 'reliable' delivery
         if (errno != ENOTCONN && errno != EINVAL) // SGI causes EINVAL
            Perror("shutdown");
      if (close(fd) < 0) // finally call close()
         Perror("close");
   }
}

但上面并不能保证任何缓冲的写入发送。

But the above does not guarantee that any buffered writes are sent.

正常关闭:我花了大约10年的时间弄清楚如何关闭套接字。但再过10年我刚懒洋洋地叫 usleep(20000)对有轻微的延迟,以确保该写缓冲区收盘前刷新。这显然​​不是很聪明,因为:

Graceful close: It took me about 10 years to figure out how to close a socket. But for another 10 years I just lazily called usleep(20000) for a slight delay to 'ensure' that the write buffer was flushed before the close. This obviously is not very clever, because:


  • 延迟太长的大部分时间。

  • 延迟的时间太短了一些时间 - !也许

  • 可能发生信号,SIGCHLD结束 usleep()函式(但我通常叫 usleep()函式两次处理这种情况 - 一个黑客)

  • 没有迹象表明这是否有效。不过,这也许并不重要,如果一)硬重置是完全正常,和/或b)你必须在链路两端的控制。

  • The delay was too long most of the time.
  • The delay was too short some of the time--maybe!
  • A signal such SIGCHLD could occur to end usleep() (but I usually called usleep() twice to handle this case--a hack).
  • There was no indication whether this works. But this is perhaps not important if a) hard resets are perfectly ok, and/or b) you have control over both sides of the link.

但在做一个适当的冲洗是出奇的难。使用 SO_LINGER 显然是的要走的路;见例如:

But doing a proper flush is surprisingly hard. Using SO_LINGER is apparently not the way to go; see for example:

SIOCOUTQ 似乎是Linux的特定的。

And SIOCOUTQ appears to be Linux-specific.

请注意关机(FD,SHUT_WR)的停止写入,相反它的名字,也许违背了 2人停机时

Note shutdown(fd, SHUT_WR) doesn't stop writing, contrary to its name, and maybe contrary to man 2 shutdown.

这code flushSocketBeforeClose()等待,直到为零字节读,或直到计时器到期。功能 haveInput()是一个简单的包装选择(2),并设置为阻止高达1 / 1/100秒。

This code flushSocketBeforeClose() waits until a read of zero bytes, or until the timer expires. The function haveInput() is a simple wrapper for select(2), and is set to block for up to 1/100th of a second.

bool haveInput(int fd, double timeout) {
   int status;
   fd_set fds;
   struct timeval tv;
   FD_ZERO(&fds);
   FD_SET(fd, &fds);
   tv.tv_sec  = (long)timeout; // cast needed for C++
   tv.tv_usec = (long)((timeout - tv.tv_sec) * 1000000); // 'suseconds_t'

   while (1) {
      if (!(status = select(fd + 1, &fds, 0, 0, &tv)))
         return FALSE;
      else if (status > 0 && FD_ISSET(fd, &fds))
         return TRUE;
      else if (status > 0)
         FatalError("I am confused");
      else if (errno != EINTR)
         FatalError("select"); // tbd EBADF: man page "an error has occurred"
   }
}

bool flushSocketBeforeClose(int fd, double timeout) {
   const double start = getWallTimeEpoch();
   char discard[99];
   ASSERT(SHUT_WR == 1);
   if (shutdown(fd, 1) != -1)
      while (getWallTimeEpoch() < start + timeout)
         while (haveInput(fd, 0.01)) // can block for 0.01 secs
            if (!read(fd, discard, sizeof discard))
               return TRUE; // success!
   return FALSE;
}

使用的示例:

   if (!flushSocketBeforeClose(fd, 2.0)) // can block for 2s
       printf("Warning: Cannot gracefully close socket\n");
   closeSocket(fd);

在上面,我的 getWallTimeEpoch()类似于时间() PERROR() PERROR()

In the above, my getWallTimeEpoch() is similar to time(), and Perror() is a wrapper for perror().

编辑:一些评论:


  • 我的第一个入场是一个有点尴尬。在OP和尼莫挑战需要结束前清除内部 SO_ERROR ,但我现在无法找到此任何引用。有问题的系统是HPUX 10.20。失败后,连接(),只调用的close()没有释放文件描述符,因为系统希望提供出色的错误给我。但是,我和大多数人一样,从来不费心去检查紧密的返回值。所以,我最终还是跑了出来文件描述符的(-n的ulimit) 这终于得到了我的注意。

  • My first admission is a bit embarrassing. The OP and Nemo challenged the need to clear the internal so_error before close, but I cannot now find any reference for this. The system in question was HPUX 10.20. After a failed connect(), just calling close() did not release the file descriptor, because the system wished to deliver an outstanding error to me. But I, like most people, never bothered to check the return value of close. So I eventually ran out of file descriptors (ulimit -n), which finally got my attention.

(非常小的点)的一位评论家反对硬codeD数值参数关机(),而不是如SHUT_WR为1.最简单的答案是,Windows使用不同的定义#/例如枚举 SD_SEND 。和许多其他作家(如Beej)使用常量,像许多遗留系统。

(very minor point) One commentator objected to the hard-coded numerical arguments to shutdown(), rather than e.g. SHUT_WR for 1. The simplest answer is that Windows uses different #defines/enums e.g. SD_SEND. And many other writers (e.g. Beej) use constants, as do many legacy systems.

另外,我永远,永远,我所有的插座设置FD_CLOEXEC,因为在我的应用我从来不希望他们传递给孩子,更重要的是,我不希望挂起的孩子影响我。

Also, I always, always, set FD_CLOEXEC on all my sockets, since in my applications I never want them passed to a child and, more importantly, I don't want a hung child to impact me.

样code设置CLOEXEC:

Sample code to set CLOEXEC:

   static void setFD_CLOEXEC(int fd) {
      int status = fcntl(fd, F_GETFD, 0);
      if (status >= 0)
         status = fcntl(fd, F_SETFD, status | FD_CLOEXEC);
      if (status < 0)
         Perror("Error getting/setting socket FD_CLOEXEC flags");
   }

这篇关于close()时不关闭套接字正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆