如何使用Java中的JSOUP在DOM树中获取任何网页的动态内容 [英] How to get dynamic contents of any web page in DOM tree using JSOUP in Java

查看:1115
本文介绍了如何使用Java中的JSOUP在DOM树中获取任何网页的动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的项目中,解析HTML页面,然后使用DOM树进行不同的操作,就像比较两个URLS的模板一样。



为此,我我使用 JSOUP



但是它不能在DOM树中加载动态内容。



你可以告诉我如何使用Java中的JSOUP加载动态内容,或者你可以告诉我任何其他方法做同样的事情吗?



编辑编号1



由于给定链接显示,它可以使用Java中的 PhantomJS Zombie.js 。你能告诉我怎么做吗?



编辑2



我首先尝试使用Selenium获取动态页面,代码如下,

  public static void main(String [ ] args)throws IOException {

// Selenium
WebDriver driver = new FirefoxDriver();
driver.get(ANY URL HERE);
String html_content = driver.getPageSource();
driver.get(此处另一个URL);
String html_content1 = driver.getPageSource();
driver.close();

// Jsoup通过解析HTML内容使DOM成为
文档doc1 = Jsoup.parse(html_content);
文档doc2 = Jsoup.parse(html_content1);

//使用DOM树的操作
}

但是经过优化也需要很多时间。现在按照你的指示,我转到HtmlUnit。
但是我无法编写代码,将动态页面源代码转换为String,然后我使用这个String进一步使用Jsoup进行拼接,帮助我使用HtmlUnit编写该代码。



使用HtmlUnit的代码: -

 包XXX.YYY.ZZZ .Template_Matching; 

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Assert;
import org.junit.Test;

/ **
*
* @author jhamb
* /
public class HtmlUnit {

@Test
public void homePage()throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage(http://www.jabong.com/Yepme-3-4Th-Sleeve-Printed-Blue-Top-Mksp-191481.html);

文档ht = page.getOwnerDocument();
System.out.println(ht);

webClient.closeAllWindows();
}

public static void main(String [] args)throws Exception {
HtmlUnit htmlUnit = new HtmlUnit();
htmlUnit.homePage();
}
}


解决方案

可怕的是,JSoup在这种情况下不会工作。



尝试使用HtmlUnit。


In my project, which parses the HTML page, then uses the DOM tree for different operations, just like, comparing templates of two URLS.

For that, I am using JSOUP.

But it does not able to load Dynamic contents in DOM tree.

Can you tell me how can I load dynamic content using JSOUP in Java, or can you tell me any other method for doing the same?

EDIT NO. 1

As given link shows, it works using PhantomJS and Zombie.js in Java. Can you tell me how can I do this ?

Edit No. 2

I first try to get dynamic page by using Selenium, and the code is as follows,

public static void main(String[] args) throws IOException {

 // Selenium
 WebDriver driver = new FirefoxDriver();
 driver.get("ANY URL HERE");  
 String html_content = driver.getPageSource();
 driver.get("ANOTHER URL HERE");
 String html_content1 = driver.getPageSource();
 driver.close();

 // Jsoup makes DOM here by parsing HTML content
 Document doc1 = Jsoup.parse(html_content);
 Document doc2 = Jsoup.parse(html_content1);

 // OPERATIONS USING DOM TREE
}

But this takes lots of time after optimizing also. Now as per your instructions, I moved to HtmlUnit. But I am not able to make code, that gets Dynamic Page source code into String , and then I use this String for further paring using Jsoup, help me to write that code using HtmlUnit.

Code using HtmlUnit :-

package XXX.YYY.ZZZ.Template_Matching;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Assert;
import org.junit.Test;

/**
 *
 * @author jhamb
 */
public class HtmlUnit {

    @Test
    public void homePage() throws Exception {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("http://www.jabong.com/Yepme-3-4Th-Sleeve-Printed-Blue-Top-Mksp-191481.html");

        Document ht = page.getOwnerDocument();
        System.out.println(ht);

        webClient.closeAllWindows();
    }

    public static void main(String[] args) throws Exception {
        HtmlUnit htmlUnit = new  HtmlUnit();
        htmlUnit.homePage();
    }
}

解决方案

I'm afraid, JSoup won't work in this case.

Try using HtmlUnit.

这篇关于如何使用Java中的JSOUP在DOM树中获取任何网页的动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆