Jsoup没有下载整个页面 [英] Jsoup not downloading entire page

查看:133
本文介绍了Jsoup没有下载整个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该网页为: http://www.hkex. com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm

我想使用Jsoup提取所有<tr class="tr_normal">元素.

I want to extract all the <tr class="tr_normal"> elements using Jsoup.

我正在使用的代码是:

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

但是大小( 1350 )比实际大小( 1452 )小. 我将此页面复制到了计算机上,并删除了一些<tr>元素.然后我运行了相同的代码,这是正确的.看来元素太多,所以jsoup无法读取所有元素?

But the size (1350) is smaller than actually have (1452). I copied this page onto my computer and deleted some <tr> elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?

那是怎么回事?谢谢!

推荐答案

问题是内部的Jsoup Http连接处理.选择器引擎没有问题. 我没有深入研究,但是处理HTTP连接的专有方式始终存在问题.我建议将其替换为HttpClient-

The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the program

  • 从文件加载= 1452
  • 从http客户端加载= 1452
  • 从jsoup connect加载= 1350
  • 使用maxBodySize = 1452从jsoup连接加载

  • load from file= 1452
  • load from http client= 1452
  • load from jsoup connect= 1350
  • load from jsoup connect using maxBodySize= 1452

package test;

import java.io.IOException;
import java.io.InputStream;

import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class TestJsoup {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
        Elements es = doc.getElementsByClass("tr_normal");
        System.out.println("load from file= " + es.size());

        doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from http client= " + es.size());

        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        doc = Jsoup.connect(url).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect= " + es.size());

        int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
        doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect using maxBodySize= " + es.size());
    }

    public static InputStream loadContentByHttpClient()
            throws ClientProtocolException, IOException {
        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        HttpClient client = HttpClientBuilder.create().build();
        HttpGet request = new HttpGet(url);
        HttpResponse response = client.execute(request);
        return response.getEntity().getContent();
    }

    public static InputStream loadContentFromClasspath()
            throws ClientProtocolException, IOException {
        return TestJsoup.class.getClassLoader().getResourceAsStream(
                "eisdeqty_pf.htm");
    }

}

这篇关于Jsoup没有下载整个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆