如何抓取不显示其数据的网页? [英] how to scrape web page that doesn't show its data?

查看:40
本文介绍了如何抓取不显示其数据的网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取以下网页:

<块引用>

打开网络选项卡,选择 XHR 过滤器,粘贴 URL

点击列表中的请求并查看详细信息.这是请求的 URL、标头和参数.

以及响应内容.

由于请求方法是 GET,您只需将 URL 粘贴到地址栏并检索内容即可.我的网址是:

https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1&tx=1563171184890&bypassPage=1&test=1&_=1563171184890https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=C&PageDir=0&doR=1&tx=1563171185129&bypassPage=0&test=1&_=1563171185129

玩了一会儿,你很容易发现参数AREA=W是Auctions Waiting"部分,而AREA=C是Auctions Closed or Canceled""部分.似乎参数 txbypassPagetest_ 根本不需要.

使用 PageDir=0doR=1 打开第一页,然后使用 PageDir=1 导航到下一页>doR=0,并使用 PageDir=-1doR=0 到上一页.

第一页

以及下一页

最后,您只需要从您的应用程序中重现该 XHR 并解析响应.根据 HTTP 请求的实现,您可能还需要添加必要的标头和 cookie 处理.

I want to scrape the following web page:

https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019

As you can see, there is lots of data, yet when I "show source", the following html for the data of interest is all there is. Where is all the data coming from? How can something be displayed that isn't in the html?

<div class="Head_W">
    <div tabindex="0"  tabindex="0"  class="Sub_Title">Auctions Waiting</div>
    <div   class="Fadebar"></div>
        <div tabindex="0"  class="PageFrame" area="W">
            <span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle"  /></span>
            <span tabindex="0" class="PageText">page <input id="curPWA" type="text" curPG="" />  of <span id="maxWA"></span> </span>
            <span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
        </div>
    <div   id="Area_W" class="Auct_Area" ref="Y" arid="W">
        <div tabindex="0"  class="Loading"></div>
    </div>
    <div  class="Fadebar"></div>
        <div tabindex="0"  class="PageFrame" area="W">
            <span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle"  /></span>
            <span tabindex="0"class="PageText">page  <input id="curPWB" type="text" curPG=""/>  of <span id="maxWB"></span> </span>
            <span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
        </div>
</div>

解决方案

The website https://charlotte.realforeclose.com uses AJAX. You need to do some reverse engineering job to find out how does it work.

Open Chrome, press F12 to open Developer Tools or choose the option from the menu.

Open Network tab, choose XHR filter, paste the URL https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019 to the browser address bar and press enter. Check XHRs logged on Network tab while the webpage is loading. First of all inspect XHRs having bigger response size.

Click on the request in the list and check details. Here are URL, headers and parameters for request.

And the response content.

Since the requests method is GET, you can just paste the URLs to address bar and retrieve the content. The URLs for me are:

https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1&tx=1563171184890&bypassPage=1&test=1&_=1563171184890
https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=C&PageDir=0&doR=1&tx=1563171185129&bypassPage=0&test=1&_=1563171185129

After playing a bit, you can easily find that parameter AREA=W is for "Auctions Waiting" section, and AREA=C is for "Auctions Closed or Canceled" section. Seems the parameters tx, bypassPage, test and _ are not necessary at all.

Open the first page with PageDir=0 and doR=1, after that navigate to next page with PageDir=1 and doR=0, and to previous page with PageDir=-1 and doR=0.

The first page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1

And the next page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=1&doR=0

Finally you just need to reproduce that XHRs from your application and parse the responses. Depending on implementation of HTTP requests you may need to add the necessary headers and cookies processing also.

这篇关于如何抓取不显示其数据的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆