使用cURL使用javascript抓取网站 [英] Scrape website with javascript using cURL

查看:146
本文介绍了使用cURL使用javascript抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试抓取该网站的数据:
http://ntthnue.edu.vn/tracuudiem

I try to scrape data of this website: http://ntthnue.edu.vn/tracuudiem

首先,当我将SBD字段插入数据 TS4740时,可以成功获得结果。但是,当我尝试运行此代码时:

First, when I insert the SBD field with data 'TS4740', I can successfully get the result. However, when I try to run this code:

这是我的PHP cURL代码:

Here is my PHP cURL code:

<?php

function getData($id) {
    $url = 'http://ntthnue.edu.vn/tracuudiem';
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, ['sbd' => $id]);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);

    curl_close($ch);

    return $result;
}

echo getData('TS4740');

我刚得到旧页面。有人可以解释为什么吗?谢谢!

I just got the old page. Can anybody explain why? Thank you!

推荐答案

请确保您添加了所有必要的标题和输入数据。正在处理此请求的服务器可以进行各种检查,以查看其是否为有效邮件。表格要求。因此,您需要欺骗请求,使其尽可能接近常规浏览器请求。

Make sure you add all the necessary headers and input data. The server that is processing this request can do all kinds of checks to see if it's a "valid" form request. As such you need to spoof the request to be as close to a regular browser request as possible.

使用诸如 Chrome开发工具之类的工具来查看请求和响应在服务器和浏览器之间发送的标头,以更好地了解您的curl设置应该是什么样的。并进一步使用 Postman 之类的应用程序使请求模拟变得异常简单,并查看哪些方法无效。

Use tools like Chrome Dev Tools to see both the request and respons headers that are sent between the server and your browser to better understand what you curl setup should be like. And further use a app like Postman to make the request simulation super easy and to see what works and not.

<?php

function getData($id) {
    $url = 'http://ntthnue.edu.vn/tracuudiem';
    $ch = curl_init($url);
    $postdata = 'namhoc=2015-2016&kythi_name=Tuy%E1%BB%83n+sinh+v%C3%A0o+l%E1%BB%9Bp+10&hoten=&sbd='.$id.'&btnSearch=T%C3%ACm+ki%E1%BA%BFm';
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Origin: http://ntthnue.edu.vn',
        'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36',
        'Content-Type: application/x-www-form-urlencoded',
        'Referer: http://ntthnue.edu.vn/tracuudiem',
    ));

    $result = curl_exec($ch);

    curl_close($ch);

    return $result;
}

echo getData('TS4740');

这篇关于使用cURL使用javascript抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆