通过抓取动态的HtmlUnit网页 [英] crawl dynamic web page using htmlunit

查看:1911
本文介绍了通过抓取动态的HtmlUnit网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的HtmlUnit从动态的网页,它使用无限滚动动态地获取数据,就像Facebook的新闻源抓取数据。我用下面的句子来模拟向下滚动事件:

I am crawling data using HtmlUnit from a dynamic webpage, which uses infinite scrolling to fetch data dynamically, just like facebook's newsfeed. I used the following sentence to simulate the scrolling down event:

webclient.setJavaScriptEnabled(true);
webclient.setAjaxController(new NicelyResynchronizingAjaxController());
ScriptResult sr=myHtmlPage.executeJavaScript("window.scrollBy(0,600)");
webclient.waitForBackgroundJavaScript(10000);
myHtmlPage=(HtmlPage)sr.getNewPage();

但似乎myHtmlPage保持不变的previous之一,也就是说,新的数据是不是在myHtmlPage追加,结果我只能抓取网页上的头几个数据。感谢您的帮助!

But it seems myHtmlPage stays the same with the previous one, i.e., new data is not appended in myHtmlPage, as a result I can only crawl the first few data on the web page. Thanks for your help!

推荐答案

我有类似的问题,其中的内容进行后装在页面滚动。我解决了它使用:

I had similiar problem where the content were post-loaded during page scrolling. I solved it using:

webClient.getCurrentWindow()setInnerHeight(Integer.MAX_VALUE的);

这篇关于通过抓取动态的HtmlUnit网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆