从包含大量 Javascript 的网页进行屏幕抓取 [英] Screen Scraping from a web page with a lot of Javascript

查看:35
本文介绍了从包含大量 Javascript 的网页进行屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求编写一个应用程序,该应用程序可以从 Intranet 网页中抓取信息并以一种易于查看的格式显示其中的某些信息.该网页一团糟,需要用户点击六个图标才能发现订购的商品是否已到达或已收到.正如您可以想象的那样,用户至少可以说这很烦人,如果有一个任何人都可以使用的应用程序在单个屏幕中列出他们的订单状态,那就太好了.

I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.

是的,我知道一个更好的解决方案是重新编写网络应用程序,但这会涉及到调用供应商,并且会花费我们一笔不小的钱.

Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.

无论如何,在研究这个时,我发现我想要抓取的网页主要是 Javascript(尽管它不使用任何 AJAX 技术).有谁知道是否存在我可以用 Javascript 提供的库或程序,然后它会吐出 DOM 供我的应用程序解析?

Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?

我几乎可以用任何语言编写应用程序,但我更喜欢 JavaFX,这样我就可以使用它.

I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.

感谢您的时间.

伊恩

推荐答案

您可以考虑使用 HTMLunit它是一个 Java 类库,用于在无需控制浏览器的情况下自动浏览,并且集成了 Mozilla Rhino Javascript 引擎来处理它加载的页面上的 javascript.还有一个 JRuby 包装器,名为 Celerity.它的 javascript 支持现在还不是很完美,但是如果您的页面不使用很多 hack,事情应该可以正常工作,性能应该比控制浏览器要好得多.此外,您不必担心抓取结束后 cookie 仍然存在,以及与控制浏览器相关的所有其他令人讨厌的事情(历史记录、自动完成、临时文件等).

You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

这篇关于从包含大量 Javascript 的网页进行屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆