Apache HTTPClient抛出java.net.SocketException:许多域的连接重置 [英] Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

查看:216
本文介绍了Apache HTTPClient抛出java.net.SocketException:许多域的连接重置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个(行为良好的)网络蜘蛛,并且我注意到某些服务器正在使Apache HttpClient给我一个SocketException-特别是:

java.net.SocketException: Connection reset

导致这种情况的代码是:

// Execute the request
HttpResponse response; 
try {
    response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
    return;//deep down in apache http sometimes throws a null pointer...  
}

对于大多数服务器来说都很好.但是对于其他人,它立即引发SocketException.

导致立即发生SocketException的站点示例: http://www.bhphotovideo.com/

效果很好(与大多数网站一样): http://www.google.com/

现在,如您所见,www.bhphotovideo.com可以在Web浏览器中正常加载.当我不使用Apache的HTTP客户端时,它也可以很好地加载. (这样的代码:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();  
 BufferedInputStream in = new BufferedInputStream(c.getInputStream());  
 Reader r = new InputStreamReader(in);     

 int i;  
 while ((i = r.read()) != -1) {  
      source.append((char) i);  
 }  

那么,为什么我不只是使用这段代码呢?嗯,我需要使用Apache HTTP客户端中的一些关键功能.

有人知道是什么原因导致某些服务器导致此异常吗?

到目前为止的研究:

  • 问题在我的本地Mac开发机器和一个AWS EC2实例上发生,因此它不是本地防火墙.

  • 似乎错误不是由远程计算机引起的,因为异常未显示"by peer"

  • 此堆栈溢出似乎是 java.net.SocketException:连接重置,但答案并未显示为什么仅从Apache HTTP Client而不是其他方法会发生这种情况.

奖金问题:我正在使用该系统进行大量爬网.除了Apache HTTP Client之外,通常有更好的Java类吗?我发现了许多问题(例如,我必须在上面的代码中捕获到NullPointerException).似乎HTTPClient对服务器通信非常挑剔-比我想要的爬虫更挑剔,因为爬虫在服务器不工作时不会仅仅中断.

谢谢!

解决方案

老实说,我没有一个完美的解决方案,但是它可以工作,所以对我来说已经足够了.

正如下面的oleg所指出的那样,Bixo创建了一个爬网程序,该爬网程序自定义了HttpClient,以更加容忍服务器.为了解决"这个问题而不是解决它,我只是在这里使用了Bixo提供的SimpleHttpFetcher: (已删除链接-因此认为我是垃圾邮件发送者,因此您必须自己用Google搜索)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact@yourcompany.com","ENTER URL"));
try {
    FetchedResult result = fetch.fetch("ENTER URL");
    System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
    e.printStackTrace();
}

该解决方案的缺点是Bixo有很多依赖关系-因此这对每个人来说都不是一件好事.但是,您始终可以通过他们对DefaultHttpClient的使用来进行工作,并查看他们如何实例化它以使其正常工作.我决定使用整个类,因为它为我处理了一些事情,例如自动重定向跟踪(并报告最终的目标URL)很有帮助.

感谢大家的帮助.

TinyBixo

大家好.因此,我喜欢Bixo的工作方式,但不喜欢Bixo具有如此多的依赖关系(包括所有Hadoop).因此,我创建了一个大大简化的Bixo,没有所有依赖项.如果您遇到上述问题,我建议您使用它(如果您想更新它,可以随时发出请求请求!)

在此处可用: https://github.com/juliuss/TinyBixo

解决方案

首先,回答您的问题:

连接重置是由服务器端的问题引起的.服务器很可能无法解析该请求或无法处理该请求,并因此在没有返回有效响应的情况下断开了连接. HttpClient生成的HTTP请求中可能导致服务器逻辑失败的原因可能是服务器错误.仅仅因为错误消息没有说"by peer"(通过对等方)并不意味着连接重置发生在客户端.

一些说明:

(1)几种流行的Web爬网程序,例如bixo http://openbixo.org/使用HttpClient不会出现重大问题但是其中许多人不得不调整HttpClient的行为,以使其对常见的HTTP协议违规行为更加宽容.默认情况下,HttpClient对HTTP协议的遵从性非常严格.

(2)为什么不向HttpClient项目报告NPE问题或您遇到的任何其他问题?

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:

java.net.SocketException: Connection reset

The code that causes this is:

// Execute the request
HttpResponse response; 
try {
    response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
    return;//deep down in apache http sometimes throws a null pointer...  
}

For most servers it's just fine. But for others, it immediately throws a SocketException.

Example of site that causes immediate SocketException: http://www.bhphotovideo.com/

Works great (as do most websites): http://www.google.com/

Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();  
 BufferedInputStream in = new BufferedInputStream(c.getInputStream());  
 Reader r = new InputStreamReader(in);     

 int i;  
 while ((i = r.read()) != -1) {  
      source.append((char) i);  
 }  

So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.

Does anyone know what causes some servers to cause this exception?

Research so far:

  • Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.

  • It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"

  • This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.

Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.

Thanks all!

Solution

Honestly, I don't have a perfect solution, but it works, so that's good enough for me.

As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here: (linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact@yourcompany.com","ENTER URL"));
try {
    FetchedResult result = fetch.fetch("ENTER URL");
    System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
    e.printStackTrace();
}

The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.

Thanks for the help all.

Edit: TinyBixo

Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)

It's available here: https://github.com/juliuss/TinyBixo

解决方案

First, to answer your question:

The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.

A few remarks:

(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.

(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?

这篇关于Apache HTTPClient抛出java.net.SocketException:许多域的连接重置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆