使用java进行Web爬网（使用Ajax / JavaScript的页面） [英] Web Crawling (Ajax/JavaScript enabled pages) using java

查看：150 发布时间：2018/12/6 14:11:16 java web-crawler crawler4j

本文介绍了使用java进行Web爬网（使用Ajax / JavaScript的页面）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对此网络抓取非常新。我正在使用 crawler4j 来抓取网站。我通过抓取这些网站来收集所需的信息。我的问题是我无法抓取以下网站的内容。 http://www.sciencedirect.com/science/article/pii/S1568494612005741 。我想从上述网站抓取以下信息（请查看随附的屏幕截图）。

如果您观察到附加的屏幕截图，则它有三个名称（在红色框中突出显示）。如果单击其中一个链接，您将看到一个弹出窗口，该弹出窗口包含有关该作者的全部信息。我想抓取该弹出窗口中的信息。

我使用以下代码抓取内容。

  public class WebContentDownloader {
 
 private Parser parser; 
 private PageFetcher pageFetcher; 
 
 public WebContentDownloader（）{
 CrawlConfig config = new CrawlConfig（）; 
 parser = new Parser（config）; 
 pageFetcher = new PageFetcher（config）; 
} 
 
私人页面下载（String url）{
 WebURL curURL = new WebURL（）; 
 curURL.setURL（url）; 
 PageFetchResult fetchResult = null; 
 try {
 fetchResult = pageFetcher.fetchHeader（curURL）; 
 if（fetchResult.getStatusCode（）== HttpStatus.SC_OK）{
 try {
 Page page = new Page（curURL）; 
 fetchResult.fetchContent（page）; 
 if（parser.parse（page，curURL.getURL（）））{
 return page; 
} 
} catch（例外e）{
 e.printStackTrace（）; 
} 
} 
} finally {
 if（fetchResult！= null）{
 fetchResult.discardContentIfNotConsumed（）; 
} 
} 
返回null; 
} 
 
 private String processUrl（String url）{
 System.out.println（Processing：+ url）; 
 Page page = download（url）; 
 if（page！= null）{
 ParseData parseData = page.getParseData（）; 
 if（parseData！= null）{
 if（parseData instanceof HtmlParseData）{
 HtmlParseData htmlParseData =（HtmlParseData）parseData; 
返回htmlParseData.getHtml（）; 
} 
} else {
 System.out.println（无法解析页面内容。）; 
} 
} else {
 System.out.println（无法获取页面内容。）; 
} 
返回null; 
} 
 
 public String getHtmlContent（String argUrl）{
 return this.processUrl（argUrl）; 
} 
}

我能够抓取上述链接中的内容/现场。但它没有我在红色框中标记的信息。我认为这些是动态链接。

我的问题是如何抓取上述链接/网站中的内容......？？

如何从基于Ajax / JavaScript的网站抓取内容...... ???

<请允许任何人帮助我。

谢谢&此致，
Amar

解决方案

您好我找到了另一个库的解决方法。我使用
Selinium WebDriver（org.openqa.selenium.WebDriver）库来提取动态内容。以下是示例代码。

  public class CollectUrls {
 
 private WebDriver driver; 
 
 public CollectUrls（）{
 this.driver = new FirefoxDriver（）; 
 this.driver.manage（）。timeouts（）。implicitlyWait（30，TimeUnit.SECONDS）; 
} 
 
 protected void next（String url，List< String> argUrlsList）{
 this.driver.get（url）; 
 String htmlContent = this.driver.getPageSource（）; 
}

此处 htmlContent 是必需的。如果您遇到任何问题，请告诉我...... ???

谢谢，
Amar

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741. I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot).

If you observe the attached screenshot it has three names (Highlighted in red boxes). If you click one of the link you will see a popup and that popup contains the whole information about that author. I want to crawl the information which are there in that popup.

I am using the following code to crawl the content.

public class WebContentDownloader {

private Parser parser;
private PageFetcher pageFetcher;

public WebContentDownloader() {
    CrawlConfig config = new CrawlConfig();
    parser = new Parser(config);
    pageFetcher = new PageFetcher(config);
}

private Page download(String url) {
    WebURL curURL = new WebURL();
    curURL.setURL(url);
    PageFetchResult fetchResult = null;
    try {
        fetchResult = pageFetcher.fetchHeader(curURL);
        if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
            try {
                Page page = new Page(curURL);
                fetchResult.fetchContent(page);
                if (parser.parse(page, curURL.getURL())) {
                    return page;
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    } finally {
        if (fetchResult != null) {
            fetchResult.discardContentIfNotConsumed();
        }
    }
    return null;
}

private String processUrl(String url) {
    System.out.println("Processing: " + url);
    Page page = download(url);
    if (page != null) {
        ParseData parseData = page.getParseData();
        if (parseData != null) {
            if (parseData instanceof HtmlParseData) {
                HtmlParseData htmlParseData = (HtmlParseData) parseData;
                return htmlParseData.getHtml();
            }
        } else {
            System.out.println("Couldn't parse the content of the page.");
        }
    } else {
        System.out.println("Couldn't fetch the content of the page.");
    }
    return null;
}

public String getHtmlContent(String argUrl) {
    return this.processUrl(argUrl);
}
}

I was able to crawl the content from the aforementioned link/site. But it doesn't have the information what I marked in the red boxes. I think those are the dynamic links.

My question is how can I crawl the content from the aforementioned link/website...???
How to crawl the content from Ajax/JavaScript based websites...???

Please can anyone help me on this.

Thanks & Regards, Amar

解决方案

Hi I found the workaround with the another library. I used Selinium WebDriver (org.openqa.selenium.WebDriver) library to extract the dynamic content. Here is the sample code.

public class CollectUrls {

private WebDriver driver;

public CollectUrls() {
    this.driver = new FirefoxDriver();
    this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}

protected void next(String url, List<String> argUrlsList) {
    this.driver.get(url);
    String htmlContent = this.driver.getPageSource();
}

Here the "htmlContent" is the required one. Please let me know if you face any issues...???

Thanks, Amar

这篇关于使用java进行Web爬网（使用Ajax / JavaScript的页面）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用java进行Web爬网（使用Ajax / JavaScript的页面） [英] Web Crawling (Ajax/JavaScript enabled pages) using java

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用java进行Web爬网（使用Ajax / JavaScript的页面） [英] Web Crawling (Ajax/JavaScript enabled pages) using java

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭