使用java进行Web爬网(使用Ajax / JavaScript的页面) [英] Web Crawling (Ajax/JavaScript enabled pages) using java
问题描述
我对此网络抓取非常新。我正在使用 crawler4j 来抓取网站。我通过抓取这些网站来收集所需的信息。我的问题是我无法抓取以下网站的内容。 http://www.sciencedirect.com/science/article/pii/S1568494612005741 。我想从上述网站抓取以下信息(请查看随附的屏幕截图)。
如果您观察到附加的屏幕截图,则它有三个名称(在红色框中突出显示)。如果单击其中一个链接,您将看到一个弹出窗口,该弹出窗口包含有关该作者的全部信息。我想抓取该弹出窗口中的信息。
我使用以下代码抓取内容。
public class WebContentDownloader {
private Parser parser;
private PageFetcher pageFetcher;
public WebContentDownloader(){
CrawlConfig config = new CrawlConfig();
parser = new Parser(config);
pageFetcher = new PageFetcher(config);
}
私人页面下载(String url){
WebURL curURL = new WebURL();
curURL.setURL(url);
PageFetchResult fetchResult = null;
try {
fetchResult = pageFetcher.fetchHeader(curURL);
if(fetchResult.getStatusCode()== HttpStatus.SC_OK){
try {
Page page = new Page(curURL);
fetchResult.fetchContent(page);
if(parser.parse(page,curURL.getURL())){
return page;
}
} catch(例外e){
e.printStackTrace();
}
}
} finally {
if(fetchResult!= null){
fetchResult.discardContentIfNotConsumed();
}
}
返回null;
}
private String processUrl(String url){
System.out.println(Processing:+ url);
Page page = download(url);
if(page!= null){
ParseData parseData = page.getParseData();
if(parseData!= null){
if(parseData instanceof HtmlParseData){
HtmlParseData htmlParseData =(HtmlParseData)parseData;
返回htmlParseData.getHtml();
}
} else {
System.out.println(无法解析页面内容。);
}
} else {
System.out.println(无法获取页面内容。);
}
返回null;
}
public String getHtmlContent(String argUrl){
return this.processUrl(argUrl);
}
}
我能够抓取上述链接中的内容/现场。但它没有我在红色框中标记的信息。我认为这些是动态链接。
- 我的问题是如何抓取上述链接/网站中的内容......? ?
- 如何从基于Ajax / JavaScript的网站抓取内容...... ???
<请允许任何人帮助我。
谢谢&此致,
Amar
您好我找到了另一个库的解决方法。我使用
Selinium WebDriver(org.openqa.selenium.WebDriver)库来提取动态内容。以下是示例代码。
public class CollectUrls {
private WebDriver driver;
public CollectUrls(){
this.driver = new FirefoxDriver();
this.driver.manage()。timeouts()。implicitlyWait(30,TimeUnit.SECONDS);
}
protected void next(String url,List< String> argUrlsList){
this.driver.get(url);
String htmlContent = this.driver.getPageSource();
}
此处 htmlContent 是必需的。如果您遇到任何问题,请告诉我...... ???
谢谢,
Amar
I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741. I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot).
If you observe the attached screenshot it has three names (Highlighted in red boxes). If you click one of the link you will see a popup and that popup contains the whole information about that author. I want to crawl the information which are there in that popup.
I am using the following code to crawl the content.
public class WebContentDownloader {
private Parser parser;
private PageFetcher pageFetcher;
public WebContentDownloader() {
CrawlConfig config = new CrawlConfig();
parser = new Parser(config);
pageFetcher = new PageFetcher(config);
}
private Page download(String url) {
WebURL curURL = new WebURL();
curURL.setURL(url);
PageFetchResult fetchResult = null;
try {
fetchResult = pageFetcher.fetchHeader(curURL);
if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
try {
Page page = new Page(curURL);
fetchResult.fetchContent(page);
if (parser.parse(page, curURL.getURL())) {
return page;
}
} catch (Exception e) {
e.printStackTrace();
}
}
} finally {
if (fetchResult != null) {
fetchResult.discardContentIfNotConsumed();
}
}
return null;
}
private String processUrl(String url) {
System.out.println("Processing: " + url);
Page page = download(url);
if (page != null) {
ParseData parseData = page.getParseData();
if (parseData != null) {
if (parseData instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) parseData;
return htmlParseData.getHtml();
}
} else {
System.out.println("Couldn't parse the content of the page.");
}
} else {
System.out.println("Couldn't fetch the content of the page.");
}
return null;
}
public String getHtmlContent(String argUrl) {
return this.processUrl(argUrl);
}
}
I was able to crawl the content from the aforementioned link/site. But it doesn't have the information what I marked in the red boxes. I think those are the dynamic links.
- My question is how can I crawl the content from the aforementioned link/website...???
- How to crawl the content from Ajax/JavaScript based websites...???
Please can anyone help me on this.
Thanks & Regards, Amar
Hi I found the workaround with the another library. I used Selinium WebDriver (org.openqa.selenium.WebDriver) library to extract the dynamic content. Here is the sample code.
public class CollectUrls {
private WebDriver driver;
public CollectUrls() {
this.driver = new FirefoxDriver();
this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}
protected void next(String url, List<String> argUrlsList) {
this.driver.get(url);
String htmlContent = this.driver.getPageSource();
}
Here the "htmlContent" is the required one. Please let me know if you face any issues...???
Thanks, Amar
这篇关于使用java进行Web爬网(使用Ajax / JavaScript的页面)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!