Java HttpClient似乎正在缓存内容 [英] Java HttpClient seems to be caching content

查看:443
本文介绍了Java HttpClient似乎正在缓存内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个简单的网页抓取工具,我需要抓取几百次相同的页面,并且页面中有一个属性是动态的,并且应该在每个请求时都会更改。我已经构建了一个基于多线程的基于HttpClient的类来处理请求,我使用 ExecutorService 来创建线程池并运行线程。问题是,动态属性有时不会改变每个请求,我最终得到像3或4后续线程相同的值。我已经阅读了很多关于HttpClient的内容,我真的找不到这个问题来自哪里。更新:这里是在每个线程中执行的代码:


$ b $更新:这是每个线程执行的代码:

  HttpContext localContext = new BasicHttpContext(); 

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params,HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params,true);

ClientConnectionManager connman = new ThreadSafeClientConnManager();

DefaultHttpClient httpclient = new DefaultHttpClient(connman,params);

HttpHost proxy = new HttpHost(inc_proxy,Integer.valueOf(inc_port));
httpclient.getParams()。setParameter(ConnRoutePNames.DEFAULT_PROXY,
proxy);

HttpGet httpGet = new HttpGet(url);
httpGet.setHeader(User-Agent,
Mozilla / 4.0(compatible; MSIE 6.0; Windows NT 5.1));

String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
timeoutConnection);

尝试{

HttpResponse response = httpclient.execute(httpGet,localContext);

HttpEntity entity = response.getEntity();

if(entity!= null){

InputStream instream = entity.getContent();
String result = convertStreamToString(instream);
// System.out.printf(Resultado \\\
%s,result +\\\
);
instream.close();

iden = StringUtils
.substringBetween(result,
< input name = \iden\value = \,
\\ \\type = \hidden \/>);
System.out.printf(IDEN:%s \\\
,iden);
EntityUtils.consume(entity);



$ b catch(ClientProtocolException e){
// TODO自动生成的catch块
System.out.println( ExcepçãoCP);
$ b $ catch(IOException e){
// TODO自动生成的catch块
System.out.println(ExcepçãoIO);
}


解决方案

HTTPClient不使用缓存默认(当你使用 DefaultHttpClient 类时)。它是这样做的,如果你使用 CachingHttpClient 这是 HttpClient 接口装饰器启用缓存:

  HttpClient client = new CachingHttpClient(new DefaultHttpClient(),cacheConfiguration); 

然后,它分析 If-Modified-Since If-None-Match 头文件来确定是否执行对远程服务器的请求,或者是否从缓存中返回结果。



我怀疑您的问题是由您的应用程序和远程服务器之间的代理服务器引起的。

您可以使用 curl 应用程序轻松测试它;

 #!/ bin / bash 

for i in { 1..50}
do
echo***执行请求编号$ i
卷曲-D - http://yourserveraddress.com -o $ i -s
完成

然后,在所有之间执行 diff 下载的文件。他们都应该有你提到的分歧。然后,添加 -x / - proxy< host [:port]> 选项来卷曲,执行此脚本并再次比较文件。如果某些回复与其他回复相同,那么您可以确定这是代理服务器问题。


I'm building a simple web-scraper and i need to fetch the same page a few hundred times, and there's an attribute in the page that is dynamic and should change at each request. I've built a multithreaded HttpClient based class to process the requests and i'm using an ExecutorService to make a thread pool and run the threads. The problem is that dynamic attribute sometimes doesn't change on each request and i end up getting the same value on like 3 or 4 subsequent threads. I've read alot about HttpClient and i really can't find where this problem comes from. Could it be something about caching, or something like it!?

Update: here is the code executed in each thread:

HttpContext localContext = new BasicHttpContext();

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
        HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);

ClientConnectionManager connman = new ThreadSafeClientConnManager();

DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);

HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
        proxy);

HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");

String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
        timeoutConnection);

try {

    HttpResponse response = httpclient.execute(httpGet, localContext);

    HttpEntity entity = response.getEntity();

    if (entity != null) {

        InputStream instream = entity.getContent();
        String result = convertStreamToString(instream);
        // System.out.printf("Resultado\n %s",result +"\n");
        instream.close();

        iden = StringUtils
                .substringBetween(result,
                        "<input name=\"iden\" value=\"",
                        "\" type=\"hidden\"/>");
        System.out.printf("IDEN:%s\n", iden);
        EntityUtils.consume(entity);
    }

}

catch (ClientProtocolException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção CP");

} catch (IOException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção IO");
}

解决方案

HTTPClient does not use cache by default (when you use DefaultHttpClient class only). It does so, if you use CachingHttpClient which is HttpClient interface decorator enabling caching:

HttpClient client = new CachingHttpClient(new DefaultHttpClient(), cacheConfiguration);

Then, it analyzes If-Modified-Since and If-None-Match headers in order to decide if request to the remote server is performed, or if its result is returned from cache.

I suspect, that your issue is caused by proxy server standing between your application and remote server.

You can test it easily with curl application; execute some number of requests omitting proxy:

#!/bin/bash

for i in {1..50}
do
  echo "*** Performing request number $i"
  curl -D - http://yourserveraddress.com -o $i -s
done

And then, execute diff between all downloaded files. All of them should have differences you mentioned. Then, add -x/--proxy <host[:port]> option to curl, execute this script and compare files again. If some responses are the same as others, then you can be sure that this is proxy server issue.

这篇关于Java HttpClient似乎正在缓存内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆