屏幕从具有大量Javascript的网页上刮取 [英] Screen Scraping from a web page with a lot of Javascript

查看:104
本文介绍了屏幕从具有大量Javascript的网页上刮取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求编写一个应用程序,该屏幕从Intranet网页上刮取信息,并以一个很好的简单的查看格式呈现某些信息。该网页是一个真正的混乱,要求用户点击六打图标,以发现有序的项目是否到达或已被接收。你可以想象,用户发现这个令人烦恼的事情至少要有一个让任何人都可以使用的应用程序,在一个屏幕上列出他们的订单状态是非常好的。

I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.

是的,我知道一个更好的解决方案是重新编写网络应用程序,但这将涉及到供应商的呼叫,并将花费我们的小财富。

Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.

进入这个我发现我想要刮擦的网页大多是Javascript(虽然它不使用任何AJAX技术)。有没有人知道一个图书馆或程序是否存在,我可以用Javascript提供,然后会吐出我的应用程序解析的DOM?

Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?

我几乎可以用任何语言编写应用程序,但我喜欢JavaFX,所以我可以玩一个。

I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.

感谢你的时间。

Ian

推荐答案

p>您可以考虑使用 HTMLunit
它是一个java类库,用于自动浏览,而无需控制浏览器,它集成了Mozilla Rhino Javascript引擎,在其加载的页面上处理JavaScript。还有一个JRuby包装器,名为Celerity。它的JavaScript支持现在还不是很完美,但是如果你的页面不使用很多黑客,应该能够正常工作,性能应该比控制浏览器更好。此外,您不必担心在您的抓取结束之后,所有其他令人讨厌的事情与控制浏览器(历史记录,自动完成,临时文件等)持续存在。

You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

这篇关于屏幕从具有大量Javascript的网页上刮取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆