从 ASP.net 用 PHP 中的 POST 数据抓取数据 [英] Scraping data with POST data in PHP from ASP.net

查看:59
本文介绍了从 ASP.net 用 PHP 中的 POST 数据抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想尝试使用 POST 数据抓取网站,但我看不到如何请求和获取结果.

I'm looking to try to scrape a site with POST data but I can't see how I can request and get the results.

网址是

https://www.vehicleenquiry.service.gov.uk/Default.aspx

注册值为 R23CCP,品牌为 CITROEN.

with the values R23CCP as registration and CITROEN as the make.

但我习惯于像 http://website.org/?make=citreon& 这样的网址;reg=r23ccp.如何在 PHP 中执行此操作并取回数据?我需要使用 Curl 吗?

But I am used to urls like http://website.org/?make=citreon&reg=r23ccp. How can I do this in PHP and fetch the data back? Would I need to use Curl?

我正在从 aspx 中的网站获取数据,所以我觉得这可能是由于 __VIEWSTATE 或 __EVENTVALIDATION 失败,但我没有使用 asp.net 的经验.

I am fetching from a website in aspx, so I have the feeling it could be down to the __VIEWSTATE or __EVENTVALIDATION failing but I have no experience with asp.net.

到目前为止我有

<?php

$url = 'https://www.vehicleenquiry.service.gov.uk/Default.aspx';
        $fields = array('__VIEWSTATE' => 'CA0B0334',
                        '__EVENTVALIDATION' => '/wEdAAefFp68E7VL+HYOzuRuFWkhBOwywjxOOgpEYFN2beEgnftoCCZcWJSqSRLD/FKuxxkI0x5r4gPeKgWgSNWptTEWInv2PXI3Jzdn3U6eHDG4Qb7lltCXTdtnDbitYujbDJI0GQSIMiv32DreL6oRbYpQ3k06XH1tmJDb9ukYqsCJMjXcVuE=',
                        'ctl00$MainContent$txtSearchVrm' => 'R23CCP',
                        'ctl00$MainContent$MakeTextBox' => 'CITROEN',
                        'ctl00$MainContent$butSearch' => 'Search'
                        );

        $fields_string = http_build_query($fields);

        $curl = curl_init($url);

        curl_setopt_array
        (
            $curl,
            array
            (
                CURLOPT_RETURNTRANSFER  =>    true,
                CURLOPT_SSL_VERIFYPEER  =>    0,  //    
                CURLOPT_SSL_VERIFYHOST  =>    0,  //        
                CURLOPT_HTTPHEADER      =>
                    array
                    (
                        'Content-type: application/x-www-form-urlencoded; charset=utf-8',
                        'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
                    ),
                CURLOPT_POST            =>    true,
                CURLOPT_POSTFIELDS      =>    $fields_string,
                CURLOPT_FOLLOWLOCATION => 1
            )
        );

        $response = curl_exec($curl);
        curl_close($curl);




        echo $response;

?>

推荐答案

他们让这变得有点复杂,你实际上必须做 2 个请求,获取一些 VIEWSTATE 数据的东西......你想要的数据在 $html 中第 43 行,享受:

they made this a bit more complicated, you actually have to do 2 requests, get some VIEWSTATE data stuff... the data you want is in $html by line 43, enjoy:

<?php





$registration_number='R23CCP';
$vehicle_maker='CITROEN';


$ch=hhb_curl_init();

$debugHeaders=array();
$debugCookies=array();
$debugRequest='';

$html=hhb_curl_exec2($ch,'https://www.vehicleenquiry.service.gov.uk/Default.aspx',$debugHeaders,$debugCookies,$debugRequest);
//first do an empty request to get a session id and cookies and the weird VIEWSTATE stuff...
$domd=@DOMDocument::loadHTML($html);
assert(is_object($domd));
$__VIEWSTATE=$domd->getElementById('__VIEWSTATE')->getAttribute('value');
$__VIEWSTATEGENERATOR=$domd->getElementById('__VIEWSTATEGENERATOR')->getAttribute('value');
$__EVENTVALIDATION=$domd->getElementById('__EVENTVALIDATION')->getAttribute('value');

var_dump('__VIEWSTATE:',$__VIEWSTATE,'__VIEWSTATEGENERATOR:',$__VIEWSTATEGENERATOR,'__EVENTVALIDATION:',$__EVENTVALIDATION,'headers:',$debugHeaders,'cookies:',$debugCookies,'html:',$html,'request:',$debugRequest,'domd:',$domd);

//now to get the POST stuff
curl_setopt_array($ch,array(
CURLOPT_POST=>true,
CURLOPT_POSTFIELDS=>http_build_query(array(
'__LASTFOCUS'=>'',
'__EVENTTARGET'=>'',
'__VIEWSTATE'=>$__VIEWSTATE,
'__VIEWSTATEGENERATOR'=>$__VIEWSTATEGENERATOR,
'__EVENTVALIDATION'=>$__EVENTVALIDATION,
'ctl00$MainContent$txtSearchVrm'=>$registration_number,
'ctl00$MainContent$MakeTextBox'=>$vehicle_maker,
'ctl00$MainContent$txtV5CDocumentReferenceNumber'=>'',
'ctl00$MainContent$butSearch'=>'Search',
))
));
$html=hhb_curl_exec2($ch,'https://www.vehicleenquiry.service.gov.uk/Default.aspx',$debugHeaders,$debugCookies,$debugRequest);
var_dump('headers:',$debugHeaders,'cookies:',$debugCookies,'html:',$html,'request:',$debugRequest);


function hhb_curl_init($custom_options_array = array())
{
    if (empty($custom_options_array)) {
        $custom_options_array = array();
        //i feel kinda bad about this.. argv[1] of curl_init wants a string(url), or NULL
        //at least i want to allow NULL aswell :/
    }
    if (!is_array($custom_options_array)) {
        throw new InvalidArgumentException('$custom_options_array must be an array!');
    }
    ;
    $options_array = array(
        CURLOPT_AUTOREFERER => true,
        CURLOPT_BINARYTRANSFER => true,
        CURLOPT_COOKIESESSION => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_FORBID_REUSE => false,
        CURLOPT_HTTPGET => true,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_SSL_VERIFYPEER => false,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 11,
        CURLOPT_ENCODING => ""
        //CURLOPT_REFERER=>'example.org',
        //CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0'
    );
    if (!array_key_exists(CURLOPT_COOKIEFILE, $custom_options_array)) {
        //do this only conditionally because tmpfile() call..
        static $curl_cookiefiles_arr = array(); //workaround for https://bugs.php.net/bug.php?id=66014
        $curl_cookiefiles_arr[]            = $options_array[CURLOPT_COOKIEFILE] = tmpfile();
        $options_array[CURLOPT_COOKIEFILE] = stream_get_meta_data($options_array[CURLOPT_COOKIEFILE]);
        $options_array[CURLOPT_COOKIEFILE] = $options_array[CURLOPT_COOKIEFILE]['uri'];

    }
    //we can't use array_merge() because of how it handles integer-keys, it would/could cause corruption
    foreach ($custom_options_array as $key => $val) {
        $options_array[$key] = $val;
    }
    unset($key, $val, $custom_options_array);
    $curl = curl_init();
    curl_setopt_array($curl, $options_array);
    return $curl;
}
function hhb_curl_exec($ch, $url)
{
    static $hhb_curl_domainCache = "";
    //$hhb_curl_domainCache=&$this->hhb_curl_domainCache;
    //$ch=&$this->curlh;
    if (!is_resource($ch) || get_resource_type($ch) !== 'curl') {
        throw new InvalidArgumentException('$ch must be a curl handle!');
    }
    if (!is_string($url)) {
        throw new InvalidArgumentException('$url must be a string!');
    }

    $tmpvar = "";
    if (parse_url($url, PHP_URL_HOST) === null) {
        if (substr($url, 0, 1) !== '/') {
            $url = $hhb_curl_domainCache . '/' . $url;
        } else {
            $url = $hhb_curl_domainCache . $url;
        }
    }
    ;

    curl_setopt($ch, CURLOPT_URL, $url);
    $html = curl_exec($ch);
    if (curl_errno($ch)) {
        throw new Exception('Curl error (curl_errno=' . curl_errno($ch) . ') on url ' . var_export($url, true) . ': ' . curl_error($ch));
        // echo 'Curl error: ' . curl_error($ch);
    }
    if ($html === '' && 203 != ($tmpvar = curl_getinfo($ch, CURLINFO_HTTP_CODE)) /*203 is "success, but no output"..*/ ) {
        throw new Exception('Curl returned nothing for ' . var_export($url, true) . ' but HTTP_RESPONSE_CODE was ' . var_export($tmpvar, true));
    }
    ;
    //remember that curl (usually) auto-follows the "Location: " http redirects..
    $hhb_curl_domainCache = parse_url(curl_getinfo($ch, CURLINFO_EFFECTIVE_URL), PHP_URL_HOST);
    return $html;
}
function hhb_curl_exec2($ch, $url, &$returnHeaders = array(), &$returnCookies = array(), &$verboseDebugInfo = "")
{
    $returnHeaders    = array();
    $returnCookies    = array();
    $verboseDebugInfo = "";
    if (!is_resource($ch) || get_resource_type($ch) !== 'curl') {
        throw new InvalidArgumentException('$ch must be a curl handle!');
    }
    if (!is_string($url)) {
        throw new InvalidArgumentException('$url must be a string!');
    }
    $verbosefileh = tmpfile();
    $verbosefile  = stream_get_meta_data($verbosefileh);
    $verbosefile  = $verbosefile['uri'];
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_STDERR, $verbosefileh);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    $html             = hhb_curl_exec($ch, $url);
    $verboseDebugInfo = file_get_contents($verbosefile);
    curl_setopt($ch, CURLOPT_STDERR, NULL);
    fclose($verbosefileh);
    unset($verbosefile, $verbosefileh);
    $headers       = array();
    $crlf          = "\x0d\x0a";
    $thepos        = strpos($html, $crlf . $crlf, 0);
    $headersString = substr($html, 0, $thepos);
    $headerArr     = explode($crlf, $headersString);
    $returnHeaders = $headerArr;
    unset($headersString, $headerArr);
    $htmlBody = substr($html, $thepos + 4); //should work on utf8/ascii headers... utf32? not so sure..
    unset($html);
    //I REALLY HOPE THERE EXIST A BETTER WAY TO GET COOKIES.. good grief this looks ugly..
    //at least it's tested and seems to work perfectly...
    $grabCookieName = function($str,&$len)
    {
        $len=0;
        $ret = "";
        $i   = 0;
        for ($i = 0; $i < strlen($str); ++$i) {
            ++$len;
            if ($str[$i] === ' ') {
                continue;
            }
            if ($str[$i] === '=') {
                --$len;
                break;
            }
            $ret .= $str[$i];
        }
        return urldecode($ret);
    };
    foreach ($returnHeaders as $header) {
        //Set-Cookie: crlfcoookielol=crlf+is%0D%0A+and+newline+is+%0D%0A+and+semicolon+is%3B+and+not+sure+what+else
        /*Set-Cookie:ci_spill=a%3A4%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22305d3d67b8016ca9661c3b032d4319df%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A14%3A%2285.164.158.128%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A109%3A%22Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F43.0.2357.132+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1436874639%3B%7Dcab1dd09f4eca466660e8a767856d013; expires=Tue, 14-Jul-2015 13:50:39 GMT; path=/
        Set-Cookie: sessionToken=abc123; Expires=Wed, 09 Jun 2021 10:18:14 GMT;
        //Cookie names cannot contain any of the following '=,; \t\r\n\013\014'
        //
        */
        if (stripos($header, "Set-Cookie:") !== 0) {
            continue;
            /**/
        }
        $header = trim(substr($header, strlen("Set-Cookie:")));
        $len=0;
        while (strlen($header) > 0) {
            $cookiename                 = $grabCookieName($header,$len);
            $returnCookies[$cookiename] = '';
            $header                     = substr($header, $len + 1); //also remove the = 
            if (strlen($header) < 1) {
                break;
            }
            ;
            $thepos = strpos($header, ';');
            if ($thepos === false) { //last cookie in this Set-Cookie.
                $returnCookies[$cookiename] = urldecode($header);
                break;
            }
            $returnCookies[$cookiename] = urldecode(substr($header, 0, $thepos));
            $header                     = trim(substr($header, $thepos + 1)); //also remove the ;
        }
    }
    unset($header, $cookiename, $thepos);
    return $htmlBody;
}

这篇关于从 ASP.net 用 PHP 中的 POST 数据抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆