将会话设置为抓取页面 [英] Set session to scrape page

查看:59
本文介绍了将会话设置为抓取页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

URL1: https://duapp3.drexel.edu/webtms_du/

URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX

URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX

作为一个个人编程项目,我想抓取我大学的课程目录并将其作为RESTful API提供.

As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API.

但是,我遇到了以下问题.

However, I'm running into the following issue.

我需要抓取的页面是URL3.但是URL3仅在我访问URL2之后返回有意义的信息(它在 Colleges.asp?Term = 201125 中设置术语),但是URL2仅在访问URL1之后才能访问.

The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there Colleges.asp?Term=201125), but URL2 can only be visited after visiting URL1.

我尝试使用Fiddler监控来回的HTTP数据,但我认为它们没有使用Cookie.关闭浏览器会立即重置所有内容,因此我怀疑他们正在使用Session.

I tried monitoring the HTTP data going to and fro using Fiddler and I don't think they are using cookies. Closing the browser instantly resets everything, so I suspect they are using Session.

如何抓取URL 3?我以编程方式尝试先访问URL 1和URL 2,然后进行 file_get_contents(url3),但这不起作用(可能是因为它注册为三个不同的会话.

How can I scrape URL 3? I tried, programatically, visiting URLs 1 and 2 first, and then doing file_get_contents(url3) but that doesn't work (probably because it registers as three different sessions.

推荐答案

会话也需要一种机制来识别您的身份.流行的方法包括:Cookie,URL中的会话ID.

A session needs a mechanism to identify you as well. Popular methods include: cookies, session id in URL.

URL 1上的 curl -v 显示确实设置了会话cookie.

A curl -v on URL 1 reveals a session cookie is indeed being set.

Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/

您需要在以后的任何请求中将此Cookie发送回服务器,以保持会话有效.

You need to send this cookie back to the server on any subsequent requests to keep your session alive.

如果要使用 file_get_contents ,则需要使用

If you want to use file_get_contents, you need to manually create a context for it with stream_context_create for to include cookies with the request.

一种替代方法(我个人更喜欢)是使用 curl 函数由PHP方便地提供.(它甚至可以为您处理cookie流量!)但这只是我的选择.

An alternative (which I would personally prefer) would be to use curl functions conveniently provided by PHP. (It can even take care of the cookie traffic for you!) But that's just my preference.

这是一个有效的示例,可用来解答您的问题.

Here's a working example to scrape the path in your question.

$scrape = array(
    "https://duapp3.drexel.edu/webtms_du/",
    "https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
    "https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);

$data = '';
$ch = curl_init();

// Set cookie jar to temporary file, because, even if we don't need them, 
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));

// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Then run along the scrape path
foreach ($scrape as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
}

curl_close($ch);

echo $data;

这篇关于将会话设置为抓取页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆