所有asp.net页面与AJAX实现分页数据刮痧 [英] Scraping data from all asp.net pages with AJAX pagination implemented

查看:194
本文介绍了所有asp.net页面与AJAX实现分页数据刮痧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要报废包含用户提供的地址列表中的网页,电子邮件等网页载有关于分页即页面的用户列表中包含10个用户,当我点击第2页链接将通过AJAX和加载用户列表形式第2页所以在更新列表中的所有分页链接。

I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links.

网站是与扩展ASP的.aspx页面即开发,因为我不知道什么asp.net和ASP如何管理分页和AJAX

Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX

我是用简单的HTML DOM <一个href=\"http://sourceforge.net/projects/simplehtmldom/\">http://sourceforge.net/projects/simplehtmldom/废钢包含

I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to scrap contain

为让用户页面&LT; = 10 我没有来模拟Ajax请求一样当分页链接,用户点击

for pages having users <=10 I dont have to simulate AJAX request same as when user clicks on pagination link

但对于有分页,以从其他网页我模拟后AJAX请求数据页

but for page having pagination to get data from other pages I am simulating post AJAX request

require 'simple_html_dom.php';

$html = file_get_html('www.example.com/user_list.aspx');

$viewstate = $html->find("#__VIEWSTATE");
$viewstate = $viewstate[0]->attr['value'];

$eventvalidation        = $html->find("#__EVENTVALIDATION");
$eventvalidation        = $eventvalidation[0]->attr['value'];
$number_of_pageinations = 3;

$pageNumberCodes = array(
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl01',
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl02',
    'ctl00$cphMainContent$rdpMembers$ctl01$ctl03'
); // this code is added for each page in POST  as  __EVENTTARGET 

for ($i = 0; $i < $number_of_pageinations; $i++) {
    $options = array(
        CURLOPT_RETURNTRANSFER => true, // return web page
        CURLOPT_HEADER => false, // don't return headers
        CURLOPT_ENCODING => "", // handle all encodings
        CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
        CURLOPT_AUTOREFERER => true, // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
        CURLOPT_TIMEOUT => 1120, // timeout on response
        CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
        CURLOPT_POST => true,
        CURLOPT_VERBOSE => true,
        CURLOPT_POSTFIELDS => urlencode('ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . $viewstate . '&__EVENTVALIDATION=' . $eventvalidation . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1')
    );
    $ch      = curl_init($url);
    curl_setopt_array($ch, $options);
    $return = curl_exec($ch);
    curl_close($ch);
    echo $return;

    $newHtml = str_get_html($return);

    $viewstate = $newHtml->find("#__VIEWSTATE");
    $viewstate = $viewstate[0]->attr['value'];

    $eventvalidation = $newHtml->find("#__EVENTVALIDATION");
    $eventvalidation = $eventvalidation[0]->attr['value'];
}

这应该呼应不同的页面数据,但它始终打印第一页的数据,任何人可以点我在哪里,我拨错的,什么是失踪
我不知道ASP如何管理paginations和AJAX请求,什么是 __ EVENTARGUMENT __ VIEWSTATE __ EVENTVALIDATION

推荐答案

在一般情况下,以虚假的ASP.NET网站认为你居然pressed一个按钮(在更一般 - 执行回传),你需要做到以下几点:

In general, in order to fake the ASP.NET web site to think that you actually pressed a button (in more general terms - performed a postback), you need to do the following:


  1. 获取每一个INPUT和SELECT元素在页面上的价值。它可能不是在每一个场景中必需的,但那里的名称开头,你应该总是至少得到所有隐藏字段的值__(如__VIEWSTATE)。你并不真的需要知道什么是写在他们 - 只是其中的价值已被发送回服务器保持不变

  1. Get the value of every single INPUT and SELECT element on the page. It might not be required in every scenario, but you should always at least get the values of all hidden fields where the name starts with "__" (such as __VIEWSTATE). You don't really need to know what is written in them - just that the value in them has to be sent back to the server unchanged.

创建一个POST请求到服务器。您需要使用经典的POST,避免任何AJAX请求。使用一些浏览器插件(在Firefox或Chrome)有可能禁用XMLHtt prequest所以你可以再拦截像小提琴手工具非AJAX请求。

Create a POST request to the server. You need to use the classic POST, avoiding any AJAX requests. Using some browser plugins (in Firefox or Chrome) it might be possible to disable XMLHttpRequest so you can then intercept the non-AJAX request with tools like Fiddler.

每个值添加#1这一职务的请求。只有两个,你需要重写值:__EVENTTARGET和__EVENTARGUMENT。你会离开那些空,除非你试图模仿链接或按钮,有一个的onclick 处理像&LT; A HREF =JavaScript的:__ doPostBack( ctl00 $登录','')&GT; 。如果是,解析来自该链接的价值观 - 第一个是事件的目标(通常将匹配在页面上的一些元素的ID),二是事件参数

Add every value from #1 to that post request. There are only two values you need to overwrite: __EVENTTARGET and __EVENTARGUMENT. You would leave those empty except if the link or button that you try to imitate has a onclick handler like <a href="javascript:__doPostBack('ctl00$login','')">. If it is, parse the values from this link - the first one is the event target (it usually will match the ID of some element on the page), the second is the event argument.

如果您正确执行的请求,你应当得到的HTML页面。如果你得到部分缓解,请检查您是否没有通过HTTP标头请求异步结果​​。

If you executed the request correctly, you should get back HTML page. If you get a partial response, check if you didn't pass the HTTP header that asks for async result.

这篇关于所有asp.net页面与AJAX实现分页数据刮痧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆