Apache HTTPClient 抛出 java.net.SocketException:许多域的连接重置 [英] Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

查看:65
本文介绍了Apache HTTPClient 抛出 java.net.SocketException:许多域的连接重置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个(表现良好的)网络蜘蛛,我注意到一些服务器导致 Apache HttpClient 给我一个 SocketException —— 特别是:

java.net.SocketException:连接重置

导致这种情况的代码是:

//执行请求HttpResponse 响应;尝试 {响应 = httpclient.execute(httpget);//httpclient 是 HttpClient 类型} catch (NullPointerException e) {return;//在apache的深处http有时会抛出一个空指针......}

对于大多数服务器来说,这很好.但对于其他人,它会立即抛出 SocketException.

导致立即 SocketException 的站点示例:http://www.bhphotovideo.com/

效果很好(大多数网站也是如此):http://www.google.com/>

现在,如您所见,www.bhphotovideo.com 在网络浏览器中可以正常加载.当我不使用 Apache 的 HTTP 客户端时,它也能正常加载.(这样的代码:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();BufferedInputStream in = new BufferedInputStream(c.getInputStream());Reader r = new InputStreamReader(in);国际我;而 ((i = r.read()) != -1) {source.append((char) i);}

那么,我为什么不直接使用此代码呢?嗯,我需要使用 Apache 的 HTTP 客户端中的一些关键功能.

有谁知道是什么原因导致某些服务器导致此异常?

目前的研究:

  • 问题出现在我的本地 Mac 开发机器和 AWS EC2 实例上,所以它不是本地防火墙.

  • 似乎错误不是由远程机器引起的,因为异常没有说由对等"

  • 这个堆栈溢出似乎无关java.net.SocketException:连接重置 但答案并没有说明为什么这只会发生在 Apache HTTP 客户端而不是其他方法中.

额外问题:我正在使用这个系统进行大量爬行.除了 Apache HTTP Client 之外,通常还有更好的 Java 类吗?我发现了许多问题(例如我必须在上面的代码中捕获的 NullPointerException).似乎 HTTPClient 对服务器通信非常挑剔 - 比我想要的爬虫更挑剔,当服务器不运行时,它不会中断.

谢谢大家!

解决方案

老实说,我没有完美的解决方案,但它有效,所以这对我来说已经足够了.

正如下面的 oleg 所指出的,Bixo 已经创建了一个爬虫,可以自定义 HttpClient 以对服务器更加宽容.为了绕过"这个问题而不是修复它,我只是在这里使用了 Bixo 提供的 SimpleHttpFetcher:(链接已删除 - 所以我认为我是垃圾邮件发送者,所以你必须自己用谷歌搜索)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact@yourcompany.com","ENTER URL"));尝试 {FetchedResult result = fetch.fetch("输入网址");System.out.println(new String(result.getContent()));} catch (BaseFetchException e) {e.printStackTrace();}

这个解决方案的缺点是 Bixo 有很多依赖项——所以这对每个人来说可能不是一个好的解决方法.但是,您始终可以通过他们使用 DefaultHttpClient 并查看他们如何实例化它以使其工作.我决定使用整个类,因为它为我处理了一些事情,例如自动重定向跟踪(并报告最终目标网址),这些很有帮助.

感谢大家的帮助.

TinyBixo

大家好.所以,我喜欢 Bixo 的工作方式,但不喜欢它有这么多依赖项(包括所有 Hadoop).因此,我创建了一个大大简化的 Bixo,没有所有依赖项.如果您遇到上述问题,我建议您使用它(如果您想更新它,请随时提出拉取请求!)

可在此处获得:https://github.com/juliuss/TinyBixo

解决方案

首先回答你的问题:

连接重置是由服务器端的问题引起的.很可能服务器未能解析请求或无法处理它并因此在没有返回有效响应的情况下断开连接.HttpClient 生成的 HTTP 请求中可能有一些东西导致服务器端逻辑失败,可能是由于服务器端错误.仅仅因为错误消息没有说by peer"并不意味着连接重置发生在客户端.

几点说明:

(1) 几个流行的网络爬虫如bixo http://openbixo.org/ 使用HttpClient 没有大问题但他们中的大部分人不得不调整 HttpClient 行为,使其对常见的 HTTP 协议违规行为更加宽容.默认情况下,HttpClient 对 HTTP 协议合规性相当严格.

(2)你为什么不向HttpClient项目报告NPE问题或你遇到的任何其他问题?

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:

java.net.SocketException: Connection reset

The code that causes this is:

// Execute the request
HttpResponse response; 
try {
    response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
    return;//deep down in apache http sometimes throws a null pointer...  
}

For most servers it's just fine. But for others, it immediately throws a SocketException.

Example of site that causes immediate SocketException: http://www.bhphotovideo.com/

Works great (as do most websites): http://www.google.com/

Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)

 HttpURLConnection c = (HttpURLConnection)url.openConnection();  
 BufferedInputStream in = new BufferedInputStream(c.getInputStream());  
 Reader r = new InputStreamReader(in);     

 int i;  
 while ((i = r.read()) != -1) {  
      source.append((char) i);  
 }  

So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.

Does anyone know what causes some servers to cause this exception?

Research so far:

  • Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.

  • It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"

  • This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.

Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.

Thanks all!

Solution

Honestly, I don't have a perfect solution, but it works, so that's good enough for me.

As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here: (linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)

SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact@yourcompany.com","ENTER URL"));
try {
    FetchedResult result = fetch.fetch("ENTER URL");
    System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
    e.printStackTrace();
}

The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.

Thanks for the help all.

Edit: TinyBixo

Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)

It's available here: https://github.com/juliuss/TinyBixo

解决方案

First, to answer your question:

The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.

A few remarks:

(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.

(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?

这篇关于Apache HTTPClient 抛出 java.net.SocketException:许多域的连接重置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆