XML Web使用动态密钥抓取网站 [英] XML web scraping a website with dynamic key

查看:110
本文介绍了XML Web使用动态密钥抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用IE从Excel抓取此网站,但最近使用IE不一致且速度慢.我的列表通常在500到1000左右,因此我必须在一夜之间运行宏.最近,宏开始挂断.这就是为什么我决定第一次使用MSXML2进行资源管理器抓取.

I have been scraping this site from Excel using IE but recently using IE has been inconsistent and slow. My list is usually around 500 to 1000 so I have to run the macro over night. Recently the macro started to hangup. That is why I decided to explorer scraping with MSXML2 for the first time.

该站点无需身份验证,但具有隐藏的输入,该输入会动态更改.

The site needs no authentication but it has hidden input that changes dynamically.

我已经完成了..我使用GET提取站点并提取了动态密钥,然后尝试使用POST将输入数据发送到站点.我不断收到服务器错误/运行时错误.我尝试使用其他标头请求选项,但仍无法获取结果页面.我也尝试使用MSXML2.ServerXMLHTTP.我在正确的轨道上吗?

What I have done.. I used GET to pull the site and extracted the dynamic key then tried to use POST to send the input data to the site. I kept on getting server error/run-time error. I have tried using different header request option but I am still not getting the result page.I have also tried to use MSXML2.ServerXMLHTTP. Am I in the right track?

Sub test_66()

  Dim oXML_get
  'Dim oXML_post
  Dim sendText As String, s2 As String, n1 As Integer, postUrl As String,      sHTML As String, s1 As String

  ' Instantiate MSXML2
  Set oXML_get = New MSXML2.XMLHTTP
  oXML_get.Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send
  sHTML = oXML_get.responseText
  'Debug.Print sHTML
  Dim hDOC As MSHTML.HTMLDocument
  Set hDOC = New MSHTML.HTMLDocument
  hDOC.body.innerHTML = sHTML
  s1 = Replace(hDOC.getElementsByTagName("input").Item(2).Value, "/", "%2F")
  s2 = Replace(hDOC.getElementsByTagName("input").Item(3).Value, "/", "%2F")
  sendText = "__VIEWSTATE=" & s1 & "&__EVENTVALIDATION=" & s2 & "&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=%20>>"
  Debug.Print sendText '"__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" & s1 & "__EVENTVALIDATION=" & s2 & 
  oXML_get.Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
  oXML_get.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
  oXML_get.setRequestHeader "Accept", "text/html;charset=UTF-8"
  oXML_get.setRequestHeader "Accept-Encoding", "identity"
  oXML_get.setRequestHeader "Accept-Charset", "UTF-8" 'Connection keep -alive
  'oXML_get.setRequestHeader "Connection", "keep -alive"
  oXML_get.send (sendText)
  Dim objIE As Object: Set objIE = CreateObject("InternetExplorer.Application")
  objIE.navigate "about:blank"
  objIE.Visible = True
  objIE.document.Write oXML_get.responseText

End Sub

这是我收到的运行时错误消息....

This is the Runtime Error message that I am getting....

Server Error in '/revenue/RealEstateTax' Application.
<!-- Web.Config Configuration File -->

<configuration>
    <system.web>
        <customErrors mode="Off"/>
    </system.web>
</configuration>

推荐答案

我已经从Firefox网页上的Web表单提交了相同的搜索请求.之后,我打开开发人员工具 F12 的网络"选项卡,单击上一个POST请求,打开参数"部分,这是已提交的参数的屏幕截图:

I've submitted the same search request from the web form on the webpage in Firefox. After that I opened Developer Tools F12, Network tab, clicked last POST request, opened Parameters section, and here is a screenshot of the parameters that have been submitted:

原始表格数据:

__ EVENTTARGET =安培; __ EVENTARGUMENT =安培; __ VIEWSTATE =%2FwEPDwULLTEyNDQ4MDU4OTkPZBYCZg9kFgICAw9kFgICDQ9kFgYCAQ9kFgICAw9kFgICAQ8QZBAVARUxNzAwIFNQUklORyBHQVJERU4gU1QVARUxNzAwIFNQUklORyBHQVJERU4gU1QUKwMBZxYBZmQCBQ8PFgIeBFRleHQFHFBsZWF​​zZSBhZGQgYWRkcmVzcyB0byBsb29rdXBkZAINDw8WAh4HVmlzaWJsZWhkFgoCAQ88KwAKAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnRmZGQCAw9kFgICBQ8PFgIeF0VuYWJsZUFqYXhTa2luUmVuZGVyaW5naGRkAgUPFCsAAg8WAh8EaGQQFgJmAgEWAg8WBB4LTmF2aWdhdGVVcmwFJC4uL0ZlZWRiYWNrRm9ybS5hc3B4P0JydE5vPTc3MjUzNDcwMB8EaGQPFgQfBQUdfi9QREZzL1BheW1lbnRfQWdyZWVtZW50cy5wZGYfBGhkDxYCZmYWAQVxVGVsZXJpay5XZWIuVUkuUmFkV2luZG93LCBUZWxlcmlrLldlYi5VSSwgVmVyc2lvbj0yMDEwLjEuNTE5LjQwLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPTEyMWZhZTc4MTY1YmEzZDQWBGYPDxYEHwUFJC4uL0ZlZWRiYWNrRm9ybS5hc3B4P0JydE5vPTc3MjUzNDcwMB8EaGRkAgEPDxYEHwUFHX4vUERGcy9QYXltZW50X0FncmVlbWVudHMucGRmHwRoZGQCBw88KwARAgAPFgQfAmcfA2ZkARAWABYAFgBkAgkPFgIeBXZhbHVlBQk3NzI1MzQ3MDBkGAIFQWN0bDAwJEJvZHlDb250ZW50UGxhY2VIb2xkZXIkR2V0VGF4SW5mb0NvbnRyb2wkZ3JkUGF5bWV udHNIaXN0b3J5DzwrAAwBCGZkBTJjdGwwMCRCb2R5Q29udGVudFBsYWNlSG9sZGVyJEdldFRheEluZm9Db250cm9sJGZybQ9nZD9K5t7genscvOsiNrdPkxL0VHWCYSsS%2FK3EZTRu3h3w&安培; __ EVENTVALIDATION =%2FwEWBQKkrNCPCgLRzsWTBwLlpIbACAKV6q2KDQKIvdHyCawQaHbBYSHV%2B%2FVvyLUTUY%2BhSsmbpTvj0W4ycfOa1RCO&安培; ctl00%24BodyContentPlaceHolder%24SearchByAddressControl%24txtLookup由+物业+地址&安培=; ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo = 043185500&安培; ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT = +%3E %3E

__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwULLTEyNDQ4MDU4OTkPZBYCZg9kFgICAw9kFgICDQ9kFgYCAQ9kFgICAw9kFgICAQ8QZBAVARUxNzAwIFNQUklORyBHQVJERU4gU1QVARUxNzAwIFNQUklORyBHQVJERU4gU1QUKwMBZxYBZmQCBQ8PFgIeBFRleHQFHFBsZWFzZSBhZGQgYWRkcmVzcyB0byBsb29rdXBkZAINDw8WAh4HVmlzaWJsZWhkFgoCAQ88KwAKAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnRmZGQCAw9kFgICBQ8PFgIeF0VuYWJsZUFqYXhTa2luUmVuZGVyaW5naGRkAgUPFCsAAg8WAh8EaGQQFgJmAgEWAg8WBB4LTmF2aWdhdGVVcmwFJC4uL0ZlZWRiYWNrRm9ybS5hc3B4P0JydE5vPTc3MjUzNDcwMB8EaGQPFgQfBQUdfi9QREZzL1BheW1lbnRfQWdyZWVtZW50cy5wZGYfBGhkDxYCZmYWAQVxVGVsZXJpay5XZWIuVUkuUmFkV2luZG93LCBUZWxlcmlrLldlYi5VSSwgVmVyc2lvbj0yMDEwLjEuNTE5LjQwLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPTEyMWZhZTc4MTY1YmEzZDQWBGYPDxYEHwUFJC4uL0ZlZWRiYWNrRm9ybS5hc3B4P0JydE5vPTc3MjUzNDcwMB8EaGRkAgEPDxYEHwUFHX4vUERGcy9QYXltZW50X0FncmVlbWVudHMucGRmHwRoZGQCBw88KwARAgAPFgQfAmcfA2ZkARAWABYAFgBkAgkPFgIeBXZhbHVlBQk3NzI1MzQ3MDBkGAIFQWN0bDAwJEJvZHlDb250ZW50UGxhY2VIb2xkZXIkR2V0VGF4SW5mb0NvbnRyb2wkZ3JkUGF5bWVudHNIaXN0b3J5DzwrAAwBCGZkBTJjdGwwMCRCb2R5Q29udGVudFBsYWNlSG9sZGVyJEdldFRheEluZm9Db250cm9sJGZybQ9nZD9K5t7genscvOsiNrdPkxL0VHWCYSsS%2FK3EZTRu3h3w&__EVENTVALIDATION=%2FwEWBQKkrNCPCgLRzsWTBwLlpIbACAKV6q2KDQKIvdHyCawQaHbBYSHV%2B%2FVvyLUTUY%2BhSsmbpTvj0W4ycfOa1RCO&ctl00%24BodyContentPlaceHolder%24SearchByAddressControl%24txtLookup=by+Property+Address&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=043185500&ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=+%3E%3E

请注意,有7个参数.所有这些都应进行URL编码.我稍作修改并修改了您的代码,还添加了一些请求标头.以下代码对我来说是正确的:

Note that there are 7 parameters. All of them should be URL-encoded. I've slightly reworked and modified your code, also added some request headers. The following code works correct for me:

Option Explicit

Sub test_66()

    Dim s1 As String
    Dim s2 As String
    Dim sResp As String
    Dim aTmp As Variant
    Dim sBRTNumber As String
    Dim sFormData As String

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
        .setRequestHeader "Accept", "text/html;charset=UTF-8"
        .setRequestHeader "Accept-Encoding", "identity"
        .setRequestHeader "Accept-Charset", "UTF-8"
        .setRequestHeader "Connection", "keep-alive"
        .send
        sResp = .responseText
    End With
    aTmp = Split(sResp, "id=""__VIEWSTATE"" value=""", 2)
    s1 = aTmp(1)
    aTmp = Split(s1, """", 2)
    s1 = aTmp(0)
    aTmp = Split(sResp, "id=""__EVENTVALIDATION"" value=""", 2)
    s2 = aTmp(1)
    aTmp = Split(s2, """", 2)
    s2 = aTmp(0)
    s1 = EncodeUriComponent(s1)
    s2 = EncodeUriComponent(s2)

    sBRTNumber = "043185500"
    sFormData = Join(Array( _
        "__EVENTTARGET=", _
        "__EVENTARGUMENT=", _
        "__VIEWSTATE=" & s1, _
        "__EVENTVALIDATION=" & s2, _
        "ctl00%24BodyContentPlaceHolder%24SearchByAddressControl%24txtLookup=by+Property+Address", _
        "ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24txtTaxInfo=" & sBRTNumber, _
        "ctl00%24BodyContentPlaceHolder%24SearchByBRTControl%24btnTaxByBRT=+%3E%3E" _
        ), "&")

    With CreateObject("MSXML2.XMLHTTP")
        .Open "POST", "http://www.phila.gov/revenue/realestatetax/default.aspx", False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .setRequestHeader "Accept", "text/html;charset=UTF-8"
        .setRequestHeader "Accept-Encoding", "identity"
        .setRequestHeader "Accept-Charset", "UTF-8"
        .setRequestHeader "Connection", "keep-alive"
        .setRequestHeader "Host", "www.phila.gov"
        .setRequestHeader "Origin", "http://www.phila.gov"
        .setRequestHeader "Referer", "http://www.phila.gov/revenue/realestatetax/default.aspx"
        .send (sFormData)
        sResp = .responseText
    End With

    With CreateObject("InternetExplorer.Application")
        .navigate "about:blank"
        .Visible = True
        .document.write sResp
    End With

End Sub

Function EncodeUriComponent(strText As String) As String
    Static objHtmlfile As Object
    If objHtmlfile Is Nothing Then
        Set objHtmlfile = CreateObject("htmlfile")
        objHtmlfile.parentWindow.execScript "function encode(s) {return encodeURIComponent(s)}", "jscript"
    End If
    EncodeUriComponent = objHtmlfile.parentWindow.encode(strText)
End Function

这是IE窗口的输出:

这篇关于XML Web使用动态密钥抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆