close() 没有正确关闭套接字 [英] close() is not closing socket properly

查看:31
本文介绍了close() 没有正确关闭套接字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个多线程服务器(线程池),它使用 20 个线程处理大量请求(一个节点高达 500/秒).有一个侦听器线程接受传入的连接并将它们排入队列以供处理程序线程处理.一旦响应准备好,线程就会写出给客户端并关闭套接字.一切似乎都很好,直到最近,一个测试客户端程序在读取响应后开始随机挂起.经过大量挖掘,似乎来自服务器的 close() 实际上并没有断开套接字.我在代码中添加了一些带有文件描述符编号的调试打印,我得到了这种类型的输出.

I have a multi-threaded server (thread pool) that is handling a large number of requests (up to 500/sec for one node), using 20 threads. There's a listener thread that accepts incoming connections and queues them for the handler threads to process. Once the response is ready, the threads then write out to the client and close the socket. All seemed to be fine until recently, a test client program started hanging randomly after reading the response. After a lot of digging, it seems that the close() from the server is not actually disconnecting the socket. I've added some debugging prints to the code with the file descriptor number and I get this type of output.

Processing request for 21
Writing to 21
Closing 21

close() 的返回值为 0,否则会打印另一个调试语句.在客户端挂起的这个输出之后,lsof 显示一个已建立的连接.

The return value of close() is 0, or there would be another debug statement printed. After this output with a client that hangs, lsof is showing an established connection.

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530(已建立)

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (ESTABLISHED)

CLIENT 17747 root 12u IPv4 32754228 TCP localhost:47530->localhost:9980(已建立)

CLIENT 17747 root 12u IPv4 32754228 TCP localhost:47530->localhost:9980 (ESTABLISHED)

就好像服务端从不向客户端发送关闭序列,这个状态一直挂到客户端被杀死,让服务端处于关闭等待状态

It's as if the server never sends the shutdown sequence to the client, and this state hangs until the client is killed, leaving the server side in a close wait state

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (CLOSE_WAIT)

SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (CLOSE_WAIT)

此外,如果客户端指定了超时,它将超时而不是挂起.我也可以手动运行

Also if the client has a timeout specified, it will timeout instead of hanging. I can also manually run

call close(21)

在服务器中从 gdb 中,然后客户端将断开连接.这种情况可能在 50,000 个请求中发生一次,但可能不会在很长一段时间内发生.

in the server from gdb, and the client will then disconnect. This happens maybe once in 50,000 requests, but might not happen for extended periods.

Linux 版本:2.6.21.7-2.fc8xenCentos 版本:5.4(最终版)

Linux version: 2.6.21.7-2.fc8xen Centos version: 5.4 (Final)

socket动作如下

服务器:

int client_socket;
struct sockaddr_in client_addr;
socklen_t client_len = sizeof(client_addr);  

while(true) {
  client_socket = accept(incoming_socket, (struct sockaddr *)&client_addr, &client_len);
  if (client_socket == -1)
    continue;
  /*  insert into queue here for threads to process  */
}

然后线程获取套接字并构建响应.

Then the thread picks up the socket and builds the response.

/*  get client_socket from queue  */

/*  processing request here  */

/*  now set to blocking for write; was previously set to non-blocking for reading  */
int flags = fcntl(client_socket, F_GETFL);
if (flags < 0)
  abort();
if (fcntl(client_socket, F_SETFL, flags|O_NONBLOCK) < 0)
  abort();

server_write(client_socket, response_buf, response_length);
server_close(client_socket);

server_write 和 server_close.

server_write and server_close.

void server_write( int fd, char const *buf, ssize_t len ) {
    printf("Writing to %d
", fd);
    while(len > 0) {
      ssize_t n = write(fd, buf, len);
      if(n <= 0)
        return;// I don't really care what error happened, we'll just drop the connection
      len -= n;
      buf += n;
    }
  }

void server_close( int fd ) {
    for(uint32_t i=0; i<10; i++) {
      int n = close(fd);
      if(!n) {//closed successfully                                                                                                                                   
        return;
      }
      usleep(100);
    }
    printf("Close failed for %d
", fd);
  }

客户:

客户端使用 libcurl v 7.27.0

Client side is using libcurl v 7.27.0

CURL *curl = curl_easy_init();
CURLcode res;
curl_easy_setopt( curl, CURLOPT_URL, url);
curl_easy_setopt( curl, CURLOPT_WRITEFUNCTION, write_callback );
curl_easy_setopt( curl, CURLOPT_WRITEDATA, write_tag );

res = curl_easy_perform(curl);

没什么特别的,只是一个基本的卷曲连接.客户端在 tranfer.c(在 libcurl 中)中挂起,因为套接字未被视为已关闭.它正在等待来自服务器的更多数据.

Nothing fancy, just a basic curl connection. Client hangs in tranfer.c (in libcurl) because the socket is not perceived as being closed. It's waiting for more data from the server.

到目前为止我尝试过的事情:

Things I've tried so far:

关闭前关闭

shutdown(fd, SHUT_WR);                                                                                                                                            
char buf[64];                                                                                                                                                     
while(read(fd, buf, 64) > 0);                                                                                                                                         
/*  then close  */ 
       

设置SO_LINGER在1秒内强行关闭

Setting SO_LINGER to close forcibly in 1 second

struct linger l;
l.l_onoff = 1;
l.l_linger = 1;
if (setsockopt(client_socket, SOL_SOCKET, SO_LINGER, &l, sizeof(l)) == -1)
  abort();

这些没有任何区别.任何想法将不胜感激.

These have made no difference. Any ideas would be greatly appreciated.

编辑——这最终成为队列库中的线程安全问题,导致多个线程不恰当地处理套接字.

EDIT -- This ended up being a thread-safety issue inside a queue library causing the socket to be handled inappropriately by multiple threads.

推荐答案

这是我在许多类 Unix 系统(例如 SunOS 4、SGI IRIX、HPUX 10.20、CentOS 5、Cygwin)上使用的一些代码来关闭插座:

Here is some code I've used on many Unix-like systems (e.g SunOS 4, SGI IRIX, HPUX 10.20, CentOS 5, Cygwin) to close a socket:

int getSO_ERROR(int fd) {
   int err = 1;
   socklen_t len = sizeof err;
   if (-1 == getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len))
      FatalError("getSO_ERROR");
   if (err)
      errno = err;              // set errno to the socket SO_ERROR
   return err;
}

void closeSocket(int fd) {      // *not* the Windows closesocket()
   if (fd >= 0) {
      getSO_ERROR(fd); // first clear any errors, which can cause close to fail
      if (shutdown(fd, SHUT_RDWR) < 0) // secondly, terminate the 'reliable' delivery
         if (errno != ENOTCONN && errno != EINVAL) // SGI causes EINVAL
            Perror("shutdown");
      if (close(fd) < 0) // finally call close()
         Perror("close");
   }
}

但以上并不能保证发送任何缓冲的写入.

But the above does not guarantee that any buffered writes are sent.

优雅关闭:我花了大约 10 年的时间才弄清楚如何关闭套接字.但是又过了 10 年,我只是懒洋洋地调用 usleep(20000) 稍作延迟,以确保"在关闭之前刷新写入缓冲区.这显然不是很聪明,因为:

Graceful close: It took me about 10 years to figure out how to close a socket. But for another 10 years I just lazily called usleep(20000) for a slight delay to 'ensure' that the write buffer was flushed before the close. This obviously is not very clever, because:

  • 大多数时候延迟时间太长.
  • 有些时候延迟太短了——也许吧!
  • 可能会出现这样的 SIGCHLD 信号以结束 usleep()(但我通常调用 usleep() 两次来处理这种情况——一个 hack).
  • 没有迹象表明这是否有效.但是,如果 a) 硬重置完全没问题,和/或 b) 您可以控制链接的两端,那么这可能并不重要.
  • The delay was too long most of the time.
  • The delay was too short some of the time--maybe!
  • A signal such SIGCHLD could occur to end usleep() (but I usually called usleep() twice to handle this case--a hack).
  • There was no indication whether this works. But this is perhaps not important if a) hard resets are perfectly ok, and/or b) you have control over both sides of the link.

但是进行适当的同花顺出奇地困难.使用SO_LINGER 显然不是要走的路;参见示例:

But doing a proper flush is surprisingly hard. Using SO_LINGER is apparently not the way to go; see for example:

而且 SIOCOUTQ 似乎是 Linux 特定的.

And SIOCOUTQ appears to be Linux-specific.

注意shutdown(fd, SHUT_WR) 不会停止编写,与其名称相反,并且可能与man 2 shutdown相反.

Note shutdown(fd, SHUT_WR) doesn't stop writing, contrary to its name, and maybe contrary to man 2 shutdown.

此代码 flushSocketBeforeClose() 等待直到读取零字节,或直到计时器到期.函数 haveInput() 是 select(2) 的一个简单包装器,并设置为阻塞最多 1/100 秒.

This code flushSocketBeforeClose() waits until a read of zero bytes, or until the timer expires. The function haveInput() is a simple wrapper for select(2), and is set to block for up to 1/100th of a second.

bool haveInput(int fd, double timeout) {
   int status;
   fd_set fds;
   struct timeval tv;
   FD_ZERO(&fds);
   FD_SET(fd, &fds);
   tv.tv_sec  = (long)timeout; // cast needed for C++
   tv.tv_usec = (long)((timeout - tv.tv_sec) * 1000000); // 'suseconds_t'

   while (1) {
      if (!(status = select(fd + 1, &fds, 0, 0, &tv)))
         return FALSE;
      else if (status > 0 && FD_ISSET(fd, &fds))
         return TRUE;
      else if (status > 0)
         FatalError("I am confused");
      else if (errno != EINTR)
         FatalError("select"); // tbd EBADF: man page "an error has occurred"
   }
}

bool flushSocketBeforeClose(int fd, double timeout) {
   const double start = getWallTimeEpoch();
   char discard[99];
   ASSERT(SHUT_WR == 1);
   if (shutdown(fd, 1) != -1)
      while (getWallTimeEpoch() < start + timeout)
         while (haveInput(fd, 0.01)) // can block for 0.01 secs
            if (!read(fd, discard, sizeof discard))
               return TRUE; // success!
   return FALSE;
}

使用示例:

   if (!flushSocketBeforeClose(fd, 2.0)) // can block for 2s
       printf("Warning: Cannot gracefully close socket
");
   closeSocket(fd);

在上面,我的getWallTimeEpoch()类似于time(),Perror()的包装器perror().

In the above, my getWallTimeEpoch() is similar to time(), and Perror() is a wrapper for perror().

一些评论:

  • 我的第一次承认有点尴尬.OP 和 Nemo 质疑在关闭之前清除内部 so_error 的需要,但我现在找不到任何参考资料.有问题的系统是 HPUX 10.20.在 connect() 失败后,仅仅调用 close() 并没有释放文件描述符,因为系统希望向我传递一个未解决的错误.但是我和大多数人一样,从不费心去检查 close 的返回值. 所以我最终用完了文件描述符 (ulimit -n), 这终于引起了我的注意.

  • My first admission is a bit embarrassing. The OP and Nemo challenged the need to clear the internal so_error before close, but I cannot now find any reference for this. The system in question was HPUX 10.20. After a failed connect(), just calling close() did not release the file descriptor, because the system wished to deliver an outstanding error to me. But I, like most people, never bothered to check the return value of close. So I eventually ran out of file descriptors (ulimit -n), which finally got my attention.

(非常重要的一点)一位评论员反对 shutdown() 的硬编码数字参数,而不是例如SHUT_WR 为 1.最简单的答案是 Windows 使用不同的 #defines/enums,例如SD_SEND.许多其他作者(例如 Beej)使用常量,许多遗留系统也是如此.

(very minor point) One commentator objected to the hard-coded numerical arguments to shutdown(), rather than e.g. SHUT_WR for 1. The simplest answer is that Windows uses different #defines/enums e.g. SD_SEND. And many other writers (e.g. Beej) use constants, as do many legacy systems.

另外,我总是,总是,在我所有的套接字上设置 FD_CLOEXEC,因为在我的应用程序中,我从不希望它们传递给孩子,更重要的是,我不希望挂着的孩子影响我.

Also, I always, always, set FD_CLOEXEC on all my sockets, since in my applications I never want them passed to a child and, more importantly, I don't want a hung child to impact me.

设置 CLOEXEC 的示例代码:

Sample code to set CLOEXEC:

   static void setFD_CLOEXEC(int fd) {
      int status = fcntl(fd, F_GETFD, 0);
      if (status >= 0)
         status = fcntl(fd, F_SETFD, status | FD_CLOEXEC);
      if (status < 0)
         Perror("Error getting/setting socket FD_CLOEXEC flags");
   }

这篇关于close() 没有正确关闭套接字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆