Jsoup获取部分页面 [英] Jsoup fetching a partial page

查看:170
本文介绍了Jsoup获取部分页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取竞标网站的内容,但无法获取该网站的完整页面.我在xulrunner上使用撬棍首先获取页面(因为ajax以惰性方式加载某些元素),然后从文件中抓取. 但是在bidrivals网站的主页上,即使本地文件格式正确,此操作也会失败. jSoup似乎只是在html代码中途以"..."字符结尾. 如果以前有人遇到过这种情况,请提供帮助. 以下代码用于[此链接].

I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain elements in lazy fashion) and then scrape from the file. But on the mainpage of bidrivals website, this fails even when the local file is well formed. jSoup simply seems to end with '...' characters midway in the html code. If anyone has encountered this before, please help. The following Code is called for [this link].

File f = new File(projectLocation+logFile+"bidrivalsHome");
    try {
        f.createNewFile();
        log.warn("Trying to fetch mainpage through a console.");
        WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome");
    } catch (Exception e) {
        e.printStackTrace();
        log.warn("Error in fetching the nameList", e);
    }
    Document doc = new Document("");
    try {
        doc = Jsoup.parse(f, "UTF-8", website);
    } catch (IOException e1) {
        System.out.println("Error while parsing the document.");
        e1.printStackTrace();
        log.warn("Error in parsing homepage", e1);
    }

推荐答案

尝试使用 HtmlUnit 呈现页面(包括JavaScript和CSS dom操作),然后将呈现的HTML传递给jsoup.

Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup.

// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml(), baseURI);

// clean up resources        
webClient.close();



page.html-源代码



page.html - source code

<html>
<head>
    <script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
    <div class="container">
        <table id="data" border="1">
            <tr>
                <th>col1</th>
                <th>col2</th>
            </tr>
        </table>
    </div>
</body>
</html>

loadData.js

    // append rows and cols to table.data in page.html
    function loadData() {
        data = document.getElementById("data");
        for (var row = 0; row < 2; row++) {
            var tr = document.createElement("tr");
            for (var col = 0; col < 2; col++) {
                td = document.createElement("td");
                td.appendChild(document.createTextNode(row + "." + col));
                tr.appendChild(td);
            }
            data.appendChild(tr);
        }
    }

page.html加载到浏览器时

| Col1 | Col2 | | ------ | ------ | | 0.0 | 0.1 | | 1.0 | 1.1 |

| Col1 | Col2 | | ------ | ------ | | 0.0 | 0.1 | | 1.0 | 1.1 |

使用jsoup解析page.html以获取列数据

    // load source from file
    Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

    // iterate over row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

输出

(空)

发生了什么事?

Jsoup解析从服务器传递的源代码(或在这种情况下从文件加载).它不调用诸如JavaScript或CSS DOM操作之类的客户端操作.在此示例中,行和列永远不会追加到数据表中.

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.

如何解析浏览器中呈现的页面?

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());

    // convert page to generated HTML and convert to document
    doc = Jsoup.parse(myPage.asXml());

    // iterate row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

    // clean up resources        
    webClient.close();

输出

0.0
0.1
1.0
1.1

这篇关于Jsoup获取部分页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆