使用分页和 JavaScript 链接时,如何从 ASP.NET 网站上抓取信息? [英] How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?

查看:19
本文介绍了使用分页和 JavaScript 链接时,如何从 ASP.NET 网站上抓取信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了一份应该是最新的员工名单,但它与用 ASP.NET 编写的 Intranet People Finder 不匹配.

由于信息是敏感的,我无法访问 People Finder 使用的数据库,因此我获取信息的唯一方法是从顶部的顶层开始抓取结构,然后遍历每一层依次.

每个人都有一个员工编号,然后形成 URL http://intranet/peoplefinder/index.aspx?srn=ABC1234 然后所有向他们报告的人都列在格式 <a id="gvEmployees_ctl03_lnkFullName" href="index.aspx?srn=ABC4321" target="_self"> 其中每个 URL 指示员工编号并提供指向其团队的链接.

当团队很大时就会出现问题,因为在 GridView 中使用诸如 <a href="javascript:__doPostBack('gvEmployees','Page$2')">2< 的 URL 实现分页./a>.

我将如何抓取此页面、捕获 SRN 和其他详细信息以及在 GridView 的所有页面上向该人报告的人,然后循环遍历每个被报告者并执行相同的过程,直到完成整个列表?

>

结果的 HTML 示例

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" ><头><标题>人物查找器:姓名姓氏</title><link rel="stylesheet" href="/path/to/style.css" type="text/css"/><link rel="stylesheet" href="/path/to/anotherStyle.css" type="text/css"/><script type="text/javascript" src="/path/to/peoplefinder.js"></script><身体><form name="form1" method="post" action="/path/to/index.aspx" id="form1"><div><input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value=""/><input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value=""/><input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="### ViewState ###"/>

<script type="text/javascript"><!--var theForm = document.forms['form1'];如果 (!theForm) {theForm = document.form1;}函数 __doPostBack(eventTarget, eventArgument) {if (!theForm.onsubmit || (theForm.onsubmit() != false)) {theForm.__EVENTTARGET.value = eventTarget;theForm.__EVENTARGUMENT.value = eventArgument;theForm.submit();}}//--><script src="/path/to/WebResource.axd?d=AueXWrgAf8xSxMTAt1Q4AA2&amp;t=633311832634916698" type="text/javascript"></script><div class="HP3CHeader"><div id="LWHPBanner"><h1><span id="lblName">姓氏</span></h1>

<div id='CPMain'><div id="mainBox"><div id="pnlEmployeeDetails"><div id='basicData'><img id="imgPhoto" class="photo" src="/path/to/photo.jpg" style="height:69px;width:69px;border-width:0px;"/><span id="lblBusinessUnit">业务单位</span><span id="lblCostCentreName">成本中心</span><span id="lblLocation">位置</span><a href='/path/to/checkcontactdetails.htm' target='_blank' onclick='return OpenCheckContactDetails();'>了解如何更改您的详细信息/照片.</a><div id="经理"><strong>报告至:</strong><a id="hlManager" href="/path/to/index.aspx?srn=ABC1234">Name Surname</a>

<div id='contactData'><div id="pnlSrn"><strong>员工人数:</strong><span id="lblSrn">ABC1234</span>

<div id="pnlEmailAddress"><strong>电子邮件地址:</strong><span id="lblEmailAddress">电子邮件</span>

<div style="clear: both"></div>

<div id="pnlGrid"><h3><span id="lblGridTitle">名称的团队</span></h3><div><table class="subordinates" cellspacing="0" cellpadding="2" rules="cols" border="1" id="gvEmployees" style="border-style:None;border-collapse:collapse;"><tr style="color:Black;background-color:#EFF3FB;border-style:None;font-weight:bold;"><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$SRN')" style="color:Black;">SRN</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$FullName')" style="color:Black;">全名</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$RACFID')" style="color:Black;">RACFID</a><;/th></tr><tr class="reports" style="background-color:White;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl02_lnkFullName" href="index.aspx?srn=1K5932" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl03_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:White;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl04_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl05_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:White;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl06_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl07_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:White;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl08_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl09_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:White;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl10_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;"><td style="width:70px;">ABC1234</td><td><a id="gvEmployees_ctl11_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">姓名姓</a></td><td>ABCD</td></tr><tr class="PagerStyle" style="color:#000039;border-style:None;"><td colspan="3"><table border="0"><tr><td><span>1</span></td><td><a href="javascript:__doPostBack('gvEmployees','Page$2')" style="color:#000039;">2</a></td></tr></table></td></tr>

<div id="searchBox"><strong>搜索人物查找器:</strong><br/><br/><span>前名:</span><br/><span><input name="txtFirstname" type="text" id="txtFirstname"/></span><br/><span>姓氏:</span><br/><span><input name="txtSurname" type="text" id="txtSurname"/></span><br/><span>RACFID:</span><br/><span><input name="txtRacfid" type="text" id="txtRacfid"/></span><br/><span>员工人数:</span><br/><span><input name="txtSrn" type="text" id="txtSrn"/></span><br/><div class="searchBoxItem" style="text-align:center;width:100%"><input type="submit" name="btnFind" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(newWebForm_PostBackOptions("btnFind&quot;, "&quot;, false, &quot;&quot;, &quot;index.aspx&quot;, false, false))" id="btnFind" title="搜索员工成员" class="button" style="border-style:Outset;"/></div><br/><div>People Finder 仅搜索英国员工.</div><!-- <div><a class="execBoardLink" href="/path/to/index.aspx?srn=ABC1234">显示执行委员会</a></div>--><div style="margin-top:5px;"><a href="/path/to/phonebook" target="phoneBook" onclick='return OpenPhonebook();'title="在新窗口中打开电话簿">打开电话簿</a></div>

<div class="contentFooter" style="text-align:center;"><table width="100%" cellpadding="0" cellspacing="0" border="0" summary="导航布局表格"><tr><td align="left"><span class="linkArrow"><</span><a href="javascript:history.back();">返回</a></td><td align="center"></td><td align="right"><span class="linkArrow">^ </span><a href="#top">Top</a></td></tr>

<div><input type="hidden" name="__PREVIOUSPAGE" id="__PREVIOUSPAGE" value="vy066Txz34y1E515UsTSTDabHKEmdBRCsq7xM0lpJls1"/><input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWCgKM3uTTAgLP/83pDwLfwaTTAQKNguzjCAKt98LeCwLZh62pDwKKqdGpBwLd2qdGpBwLd2q7jAwKa+6BYMBAL4GpBwKM3uTTAgLP/83pDwLfwaTTAQKNguzjCAKt97</div></form>

解决方案

您可以将变量发布到 HTML 页面以进行分页.

string lcUrl = "http://www.mysite.com/page.aspx";HttpWebRequest loHttp =(HttpWebRequest) WebRequest.Create(lcUrl);//*** 发送任何 POST 数据字符串 lcPostData ="gvEmployees=" + HttpUtility.UrlEncode("Page$2");loHttp.Method="POST";字节 [] lbPostBuffer = System.Text.Encoding.GetEncoding(1252).GetBytes(lcPostData);loHttp.ContentLength = lbPostBuffer.Length;流 loPostData = loHttp.GetRequestStream();loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);loPostData.Close();HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();编码编码 = System.Text.Encoding.GetEncoding(1252);StreamReader loResponseStream =新 StreamReader(loWebResponse.GetResponseStream(),enc);字符串 lcHtml = loResponseStream.ReadToEnd();loWebResponse.Close();loResponseStream.Close();

然后从字符串中解析出你需要的数据.

--编辑--

这是我将尝试(类似的东西)发送所有帖子数据的方法:

string lcPostData ="__EVENTTARGET" + HttpUtility.UrlEncode("gvEmployees");&"__EVENTARGUMENT" + HttpUtility.UrlEncode("Page%242");&"__VIEWSTATE" + HttpUtility.UrlEncode("<_Viewstate的值>");

I have been given a staff list which is supposed to be up to date but it doesn't match an intranet People Finder which is written in ASP.NET.

As the information is sensitive I am not able to access the database the People Finder is using so the only way I can get at the information is by scraping the structure starting at the top brass at the top and then going through each tier in turn.

Each person has a Staff number which then forms the URL http://intranet/peoplefinder/index.aspx?srn=ABC1234 and then all the people who report to them are listed underneth in the format <a id="gvEmployees_ctl03_lnkFullName" href="index.aspx?srn=ABC4321" target="_self"> where each URL indicates the Staff number and provides a link to their team.

The trouble arises when the teams are big as paging is implemented in the GridView with an URL such as <a href="javascript:__doPostBack('gvEmployees','Page$2')">2</a>.

How would I scrape this page, capture the SRN and other details along with the people who report to the person on all pages of the GridView then loop through each reportee and do the same process until the whole list is complete?

Example HTML of result

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head><title>
    People Finder: Name Surname
</title><link rel="stylesheet" href="/path/to/style.css" type="text/css" /><link rel="stylesheet" href="/path/to/anotherStyle.css" type="text/css" />
    <script type="text/javascript" src="/path/to/peoplefinder.js"></script>
</head>
<body>
    <form name="form1" method="post" action="/path/to/index.aspx" id="form1">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="### ViewState ###" />
</div>

<script type="text/javascript">
<!--
var theForm = document.forms['form1'];
if (!theForm) {
    theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
// -->
</script>


<script src="/path/to/WebResource.axd?d=AueXWrgAf8xSxMTAt1Q4AA2&amp;t=633311832634916698" type="text/javascript"></script>

        <div class="HP3CHeader">
            <div id="LWHPBanner">
                <h1><span id="lblName">Name Surname</span></h1>
            </div>
        </div>

        <div id='CPMain'>
            <div id="mainBox">

            <div id="pnlEmployeeDetails">

                <div id='basicData'>
                    <img id="imgPhoto" class="photo" src="/path/to/photo.jpg" style="height:69px;width:69px;border-width:0px;" />
                    <span id="lblBusinessUnit">Business Unit</span>
                    <span id="lblCostCentreName">Cost Centre</span>
                    <span id="lblLocation">Location</span>

                    <a href='/path/to/checkcontactdetails.htm' target='_blank' onclick='return OpenCheckContactDetails();' >Find out how to change your details/photo.</a>
                    <div id="manager">
        <strong>Reports to: </strong><a id="hlManager" href="/path/to/index.aspx?srn=ABC1234">Name Surname</a>
    </div>
                </div>

                <div id='contactData'>

                    <div id="pnlSrn">
        <strong>Staff number:</strong> <span id="lblSrn">ABC1234</span>
    </div>


                    <div id="pnlEmailAddress">
        <strong>Email Address:</strong> <span id="lblEmailAddress">Email</span>
    </div>
                    <div style="clear: both"></div>
                </div>

</div>

            <div id="pnlGrid">

                <h3><span id="lblGridTitle">Name's team</span></h3>
            <div>
        <table class="subordinates" cellspacing="0" cellpadding="2" rules="cols" border="1" id="gvEmployees" style="border-style:None;border-collapse:collapse;">
            <tr style="color:Black;background-color:#EFF3FB;border-style:None;font-weight:bold;">
                <th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$SRN')" style="color:Black;">SRN</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$FullName')" style="color:Black;">Full name</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$RACFID')" style="color:Black;">RACFID</a></th>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl02_lnkFullName" href="index.aspx?srn=1K5932" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl03_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl04_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl05_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl06_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl07_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl08_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl09_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl10_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl11_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="PagerStyle" style="color:#000039;border-style:None;">
                <td colspan="3"><table border="0">
                    <tr>
                        <td><span>1</span></td><td><a href="javascript:__doPostBack('gvEmployees','Page$2')" style="color:#000039;">2</a></td>
                    </tr>
                </table></td>
            </tr>
        </table>
    </div>

</div>
            </div>

            <div id="searchBox">
                <strong>Search People Finder:</strong>
                <br /><br />
                <span>Forename:</span><br/>
                <span><input name="txtFirstname" type="text" id="txtFirstname" /></span><br/>
                <span>Surname:</span><br/>
                <span><input name="txtSurname" type="text" id="txtSurname" /></span><br/>
                <span>RACFID:</span><br/>
                <span><input name="txtRacfid" type="text" id="txtRacfid" /></span><br/>
                <span>Staff number:</span><br/>
                <span><input name="txtSrn" type="text" id="txtSrn" /></span><br/>
                <div class="searchBoxItem" style="text-align:center;width:100%"><input type="submit" name="btnFind" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btnFind&quot;, &quot;&quot;, false, &quot;&quot;, &quot;index.aspx&quot;, false, false))" id="btnFind" title="Search for employees member" class="button" style="border-style:Outset;" /></div><br/> 
                <div>People Finder searches only UK staff.</div> 
               <!-- <div><a class="execBoardLink" href="/path/to/index.aspx?srn=ABC1234">Show Executive Board</a></div> -->
                <div style="margin-top:5px;"><a href="/path/to/phonebook" target="phoneBook" onclick='return OpenPhonebook();' title="Open Phonebook in new window">Open Phonebook</a></div>
            </div>
        </div>

        <div class="contentFooter"  style="text-align:center;">
            <table width="100%" cellpadding="0" cellspacing="0" border="0" summary="Navigation layout table">
                <tr>
                    <td align="left"><span class="linkArrow">&lt;</span> <a href="javascript:history.back();">Back</a></td>
                    <td align="center"></td>
                    <td align="right"><span class="linkArrow">^ </span><a href="#top">Top</a></td>
                </tr>
            </table>
        </div> 

<div>

    <input type="hidden" name="__PREVIOUSPAGE" id="__PREVIOUSPAGE" value="vy066Txz34y1E515UsTSTDabHKEmdBRCsq7xM0lpJls1" />
    <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWCgKM3uTTAgLP/83pDwLfwaTTAQKNguzjCAKt98LeCwLZh62pDwKKqdGpBwLd2q7jAwKa+5aMBAL5zb65C42zY4GBEUKujhjtZ/hZ8sLESfiF" />
</div></form>
</body>
</html>

解决方案

You could post a variable to the HTML page to go through the paging.

string lcUrl = "http://www.mysite.com/page.aspx";

HttpWebRequest loHttp =

   (HttpWebRequest) WebRequest.Create(lcUrl);


// *** Send any POST data

string lcPostData =

   "gvEmployees=" + HttpUtility.UrlEncode("Page$2");

loHttp.Method="POST";

byte [] lbPostBuffer = System.Text.           

                       Encoding.GetEncoding(1252).GetBytes(lcPostData);

loHttp.ContentLength = lbPostBuffer.Length;

Stream loPostData = loHttp.GetRequestStream();

loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);

loPostData.Close();

HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();

Encoding enc = System.Text.Encoding.GetEncoding(1252);

StreamReader loResponseStream =

   new StreamReader(loWebResponse.GetResponseStream(),enc);

string lcHtml = loResponseStream.ReadToEnd();

loWebResponse.Close();

loResponseStream.Close();

Then parse out the data you need from the string.

--EDIT--

Here is what I would try (something similar) where all of the post data is sent:

string lcPostData =

       "__EVENTTARGET" + HttpUtility.UrlEncode("gvEmployees"); &
"__EVENTARGUMENT" + HttpUtility.UrlEncode("Page%242"); &
"__VIEWSTATE" + HttpUtility.UrlEncode("<Value of _Viewstate>");

这篇关于使用分页和 JavaScript 链接时,如何从 ASP.NET 网站上抓取信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
C#/.NET最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆