刮痧的AngularJS应用 [英] Scraping an AngularJS application

查看:104
本文介绍了刮痧的AngularJS应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我再杀与Rails的HTML页面,使用引入nokogiri

I'm scrapping some HTML pages with Rails, using Nokogiri.

我遇到了一些问题,当我试图给废了AngularJS页面,因为创业板打开HTML之前已经完全呈现。

I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.

是否有某种方式放弃这种类型的页面?我怎样才能在页面刮之前完全呈现?

Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?

推荐答案

如果你想凑AngularJS页,完全通用的方式,那么你可能会需要像什么@tadman在评论中提到( PhantomJS) - 某些类型的无头的浏览器,充分处理AngularJS JavaScript和向上打开DOM来检查事后

If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.

如果你有一个特定的网站或网站,你正在寻找刮,阻力最小的路径是可能避免的AngularJS前端完全和直接查询从哪个角度code是拉动内容的API。许多/大多数AngularJS网站标准的情况是,他们拉下静态JS和HTML code /模板,然后他们让Ajax调用回服务器(无论是自己的,或一些第三方API)来获取内容将被渲染。如果您在自己的code看一看,你可能可以直接查询任何角度呼唤(即通过HTTP $,ngResource或restangular)。返回的数据通常是JSON和会更容易收集与真正的拼抢,在后呈现的HTML的结果。

If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.

这篇关于刮痧的AngularJS应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆