Java解析JS生成的html元素 [英] Java parsing html elements generated by JS

查看:353
本文介绍了Java解析JS生成的html元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用Java进行html解析非常新,我以前使用JSoup解析简单的html而不动态更改,但是我现在需要解析具有动态元素的网页。这是我尝试使用先前解析网页的代码,但是由于在页面加载后添加了元素,因此无法找到这些元素。问题是一个页面使用带有标记的谷歌地图,我试图刮掉这些标记的图像。

I'm very new to html parsing with Java, I used JSoup previously to parse simple html without it dynamically changing, however I now need to parse a web page that has dynamic elements. This is the code I attempted to parse the web page with prior however it was impossible to find the elements since they where added after the page had loaded. The situation is question is a page that uses google maps with markers on it, I'm attempting to scrape the images of these markers.

    public static void main(String[] args) {
try {
    doc = Jsoup.connect("https://pokevision.com")
            .userAgent(
                    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36")
            .get();
} catch (IOException e) {
    e.printStackTrace();
}
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");

for (Element image : images) {
    System.out.println("src : " + image.attr("src"));
}

}

所以很明显这个操作是不可能的使用JSoup,我可以使用其他库来查找图像源。

So since apparently this operation is impossible with JSoup, what other libraries can I use to find the image sources.

推荐答案

您面临的问题是 Jsoup 检索静态源代码,因为它将传递给浏览器。你想要的是在调用javaScript之后的DOM。为此,您可以使用 HTML单元获取呈现的页面,然后将其内容传递给 Jsoup 用于解析。

The problem you are facing is Jsoup retrieves the static source code, as it would be delivered to a browser. What you want is the DOM after the javaScript has been invoked. For this, you can use HTML Unit to get the rendered page and then pass its content to Jsoup for parsing.

// capture rendered page
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage("https://pokevision.com");

// convert to jsoup dom
Document doc = Jsoup.parse(myPage.asXml());

// extract data using jsoup selectors
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
    System.out.println("src : " + image.attr("src"));
}

// clean up resources
webClient.close();

这篇关于Java解析JS生成的html元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆