如何解析包含Javascript的网页? [英] How to parse a webpage that includes Javascript?

查看:574
本文介绍了如何解析包含Javascript的网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用Javascript创建表格的网页。现在我在我的Java项目中使用JSoup来解析网页。顺便说一句,JSoup无法运行Javascript,因此不会生成表格,并且网页的来源不完整。
我如何包含该脚本创建的HTML代码,以便使用JSoup解析其内容?你能提供一个简单的例子吗?谢谢!

I've got a webpage that creates a table using Javascript. Right now I'm using JSoup in my Java project to parse the webpage. By the way JSoup isn't able to run Javascript so the table isn't generated and the source of the webpage is incomplete. How can I include the HTML code created by that script in order to parse its content using JSoup? Can you provide a simple example? Thank you!

网页示例:

<!doctype html>
<html>
  <head>
    <title>A blank HTML5 page</title>
    <meta charset="utf-8" />
  </head>
  <body>
    <script>
        var table = document.createElement("table");
        var tr = document.createElement("tr");
        table.appendChild(tr);
        document.body.appendChild(table);
    </script>
    <p>First paragraph</p>
  </body>
</html>

输出应为:

<!DOCTYPE html>
<html>
    <head>
        <title>
            A blank HTML5 page
        </title>
        <meta charset="utf-8"></meta>
    </head>
    <body>
        <script>
            var table = document.createElement("table");
            var tr = document.createElement("tr");
            table.appendChild(tr);
            document.body.appendChild(table);   
        </script>
        <table>
            <tr></tr>
        </table>
        <p>
            First paragraph
        </p>
    </body>
</html>

顺便说一句,JSoup不包含表标记,因为它无法执行Javascript 。我怎样才能做到这一点?

By the way, JSoup doesn't include the table tag as it isn't able to execute Javascript. How can I achieve this?

推荐答案

第一种可能性

你在Jsoup之外有一些选择,即使用真正的浏览器并与之交互。最好的选择是 selenium webdriver 。使用selenium,您可以使用不同的浏览器作为后端,也许在您的情况下,非常轻量级的 htmlUnit 就可以了。如果调用更复杂的JavaScript,则通常没有其他选择来运行完整的浏览器。幸运的是, phantomjs 已经存在,它的足迹并不太糟糕(无头和所有)。

You have some options outside Jsoup, i.e. employing a "real" browser and interact with it. An excellent choice for this would be selenium webdriver. With selenium you can use different browsers as back end, and maybe in your case the very lightweight htmlUnit would do already. If more complicated JavaScript is called there is often no other choice then running a full browser. Luckily, phantomjs is out there and its footprint is not too bad (headless and all).

第二种可能性

另一种方法可能是您使用JSoup获取javascript源并在其中启动JavaScript解释器Java的。为此,您可以使用 Rhino 。但是,如果你走那条路,你也可以直接使用HtmlUnit,这可能不那么笨重。

Another approach could be that you grab the javascript source with JSoup and start a JavaScript interpreter within Java. For that you could use Rhino. However, if you go that path you might as well use HtmlUnit directly, which is probably a bit less bulky.

这篇关于如何解析包含Javascript的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆