如何抓取网站内容(* COMPLEX * iframe,javascript提交) [英] How to scrape website content (*COMPLEX* iframe, javascript submission)

查看:90
本文介绍了如何抓取网站内容(* COMPLEX * iframe,javascript提交)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

之前我已经完成了网络抓取,但它从未如此复杂。我想从学校网站上获取课程信息。但是,所有课程信息都显示在网络刮刀的噩梦中。

I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.

首先,当您点击课程表网址时,它会先指导您浏览其他几个页面(我相信设置Cookie并检查其他垃圾)。

First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).

然后它最终加载一个带有iframe的页面,显然只是在从机构的网页(即arizona.edu)中加载时才加载。

Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).

从那里提交表单必须通过实际上没有重新加载页面但只提交AJAX查询的按钮进行,我认为它只是操纵iframe。

From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.

这个查询对我来说特别难以复制。我一直在使用PHP和curl模拟访问初始页面的浏览器,收集适当的cookie等。但是我认为我的curl函数发送的头文件有问题,因为在最初的搜索表单加载后它永远不会让我执行任何类型的查询。

This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.

任何帮助都会很棒......

Any help would be awesome...

http://www.arizona.edu/students/registering-classes - >课程安排

或者只是在这里:
< a href =http://schedule.arizona.edu/ =nofollow> http://schedule.arizona.edu/

推荐答案

如果你需要使用大量的JS / AJAX来抓取网站 - 你需要比php更强大的东西;)

If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)

首先 - 它必须是具有执行JS功能的完整浏览器,其次 - 必须有一些api用于自动浏览。

First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.

假设你还是个孩子(还需要其他人)解析学校) - 使用 iMacros 尝试使用Firefox。如果你是经验丰富的老手 - 看看Selenium。

Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.

这篇关于如何抓取网站内容(* COMPLEX * iframe,javascript提交)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆