使用Excel VBA用XML HTTP请求来抓取网站:等待页面完全加载 [英] Scrape website with XML HTTP request with Excel VBA: wait for the page to fully load

查看:138
本文介绍了使用Excel VBA用XML HTTP请求来抓取网站:等待页面完全加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Excel VBA从网页上抓取产品价格.使用VBA Internet Explorer导航请求时,以下代码有效.但是,我想使用XML HTTP请求来加快抓取过程.

在IE请求代码中,我告诉应用程序等待3秒,以使页面完全加载并能够抓取产品价格.如果不包括此行,则找不到价格.

我尝试使用XML HTTP请求(请参见第二个代码)更改此设置,但未成功.找不到价格输出.似乎代码试图在完全加载页面之前抓取该页面.

如何调整XML HTTP请求代码,以便它能找到产品价格(并且仅在页面(和脚本)完全加载时才开始搜索/抓取?

以下IE请求代码正在运行:(立即调试.打印产品价格)

  Sub Get_Product_Price_AH_IE()昏暗的IE作为新的SHDocVw.InternetExplorer将HTMLDoc变暗为MSHTML.HTMLDocument昏暗AHArticles作为MSHTML.IHTMLElementCollection昏暗的AHArticle为MSHTML.IHTMLElement昏暗的AHEuros作为MSHTML.IHTMLElementCollection昏暗的AHCents作为MSHTML.IHTMLElementCollection昏暗AHPrice欧元为双倍昏暗AHPriceCent为两倍暗淡AH价格为两倍IE.Visible = FalseIE.navigate"https://www.ah.nl/producten/product/wi3640/lu-bastogne-biscuits-original"IE.readyState<>READYSTATE_COMPLETE环形设置HTMLDoc = IE.document'等待页面完全加载以获取价格数据立即等待+#12:00:03 AM#设置AHArticles = HTMLDoc.getElementsByTagName("article")对于AHArticles中的每个AHArticle如果AHArticle.getAttribute("data-sku")="wi3640",则设置AHEuros = AHArticle.getElementsByClassName("price__integer")设置AHCents = AHArticle.getElementsByClassName("price__fractional")AHPriceEuro = AHEuros.Item(0).innerTextAHPriceCent = AHCents.Item(0).innerTextAH价格= AH价格欧元+(AHPriceCent/100)Debug.Print AHPrice退出万一下一篇AHArticle退出浏览器结束子 

以下XML HTTP请求未提供所需的输出(即时调试屏幕中未显示价格):

  Sub Get_Product_Price_AH_XML()昏暗的XMLReq作为新的MSXML2.XMLHTTP60将HTMLDoc设为新的MSHTML.HTMLDocument昏暗的AHArticles作为MSHTML.IHTMLElementCollection昏暗的AHArticle为MSHTML.IHTMLElement昏暗的AHEuros作为MSHTML.IHTMLElementCollection昏暗的AHCents作为MSHTML.IHTMLElementCollection昏暗AHPrice欧元为双倍昏暗AHPriceCent为两倍暗淡AH价格为两倍XMLReq.Open"GET","https://www.ah.nl/producten/product/wi3640/lu-bastogne-biscuits-original",FalseXML请求发送如果XMLReq.Status<>200然后MsgBox问题"和vbNewLine&XML请求状态-"&XMLReq.statusText退出子万一HTMLDoc.body.innerHTML = XMLReq.responseText立即等待+#12:00:03 AM#设置AHArticles = HTMLDoc.getElementsByTagName("article")对于AHArticles中的每个AHArticle如果AHArticle.getAttribute("data-sku")="wi3640",则设置AHEuros = AHArticle.getElementsByClassName("price__integer")设置AHCents = AHArticle.getElementsByClassName("price__fractional")AHPriceEuro = AHEuros.Item(0).innerTextAHPriceCent = AHCents.Item(0).innerTextAH价格= AH价格欧元+(AHPriceCent/100)Debug.Print AHPrice退出万一下一篇AHArticle结束子 

解决方案

REST API HTTP请求:

您所注意到的,您当前的方法不允许页面完全加载.您可以使用URLEncode将编码的URL字符串传递给REST API

您可以看到的目标价格在位置1的字符串中.

因此,该字符串的提取方式为:

  Split(sResponse,""now":")(1) 

然后我们只需要获取价格,因此请使用分隔符}" 再次使用 Split 来获取 1.55 :

  Split(Split(sResponse,""now":")(1),}") 

这将导致以下数组(缩短为相当长的时间):

我们想要的价格现在在新数组中的位置0,这就是为什么我们可以使用以下内容提取响应的原因.

  price = Split(Split(sResponse,""now":")(1),}")(0) 

使用JSON解析器:

如果要遍历json结构,可以使用以下内容:

  Dim json作为对象设置json = JsonConverter.ParseJson(sResponse)("_ embedded")("lanes")(5)("_ embedded")("items")(1)("_ embedded")("product")("priceLabel")Debug.Print json("now") 

下载并添加 JSONConverter.bas 后,您可以通过 VBE>添加对 Microsoft Scripting Runtime 的引用.工具>参考.这上面的 Set json 代码语句代表价格的路径,如下面的JSON结构所示.我折叠了一些细节以使路径更清晰.您可以将上述几行代码插入原始代码中,以代替 Split 行.

在上图中的 [] 表示. {} 表示.我这行的语法

  Set json = JsonConverter.ParseJson(sResponse)("_ embedded")("lanes")(5)("_ embedded")("items")(1)("_ embedded")("product")("priceLabel") 

展示了导航这两种对象类型的不同语法.

I'm trying to scrape a product price from a webpage using Excel VBA. The following code is working when using VBA Internet Explorer navigate request. However I would like to use an XML HTTP request instead to speed up the scraping process.

In the IE request code I tell the application to wait for 3 seconds to have the page fully load and be able to scrape the product price. If this line is not included it won't find the price.

I tried to change this with an XML HTTP request (see the second code) but without success. No price output was found. It seems that the code tries to scrape the page before it has been fully loaded.

How can I adjust the XML HTTP request code so that it will find the product price (and only start searching/scraping when the page (and scripts) are fully loaded?

The following IE request code is working: (immediate debug.prints a price of the product)

Sub Get_Product_Price_AH_IE()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument

Dim AHArticles As MSHTML.IHTMLElementCollection
Dim AHArticle As MSHTML.IHTMLElement

Dim AHEuros As MSHTML.IHTMLElementCollection
Dim AHCents As MSHTML.IHTMLElementCollection

Dim AHPriceEuro As Double
Dim AHPriceCent As Double
Dim AHPrice As Double


IE.Visible = False
IE.navigate "https://www.ah.nl/producten/product/wi3640/lu-bastogne-biscuits-original"


    Do While IE.readyState <> READYSTATE_COMPLETE
    Loop

    Set HTMLDoc = IE.document

'wait for the page to fully load to be able to get price data
Application.Wait Now + #12:00:03 AM#


Set AHArticles = HTMLDoc.getElementsByTagName("article")

For Each AHArticle In AHArticles

 If AHArticle.getAttribute("data-sku") = "wi3640" Then

        Set AHEuros = AHArticle.getElementsByClassName("price__integer")
        Set AHCents = AHArticle.getElementsByClassName("price__fractional")

       AHPriceEuro = AHEuros.Item(0).innerText
       AHPriceCent = AHCents.Item(0).innerText

      AHPrice = AHPriceEuro + (AHPriceCent / 100)

Debug.Print AHPrice

            Exit For
        End If


Next AHArticle

IE.Quit

End Sub

The following XML HTTP request is not giving the desired output (no price is printed in the immediate debug screen):

Sub Get_Product_Price_AH_XML()

Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument

Dim AHArticles As MSHTML.IHTMLElementCollection
Dim AHArticle As MSHTML.IHTMLElement

Dim AHEuros As MSHTML.IHTMLElementCollection
Dim AHCents As MSHTML.IHTMLElementCollection

Dim AHPriceEuro As Double
Dim AHPriceCent As Double
Dim AHPrice As Double


XMLReq.Open "GET", "https://www.ah.nl/producten/product/wi3640/lu-bastogne-biscuits-original", False
XMLReq.send


If XMLReq.Status <> 200 Then
    MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
    Exit Sub
    End If

HTMLDoc.body.innerHTML = XMLReq.responseText


Application.Wait Now + #12:00:03 AM#


Set AHArticles = HTMLDoc.getElementsByTagName("article")

For Each AHArticle In AHArticles

 If AHArticle.getAttribute("data-sku") = "wi3640" Then

        Set AHEuros = AHArticle.getElementsByClassName("price__integer")
        Set AHCents = AHArticle.getElementsByClassName("price__fractional")

       AHPriceEuro = AHEuros.Item(0).innerText
       AHPriceCent = AHCents.Item(0).innerText

      AHPrice = AHPriceEuro + (AHPriceCent / 100)

Debug.Print AHPrice

            Exit For
        End If


Next AHArticle


End Sub

解决方案

REST API HTTP Request:

Your current method does not allow for the page to load fully as you have noted. You can formulate a REST API XMLHTTPrequest, using URLEncode to pass an encoded URL string to the API. The server sends back a JSON response containing the value you are after and lots of other info as well.

I demonstrate two methods of extracting the price info from the returned JSON string: ① Using the Split function to extract the price by generating substrings until the required string is left; ② Using a JSONParser to navigate the JSON structure and return the required value.

Code:

The following uses Split to extract the value.

Option Explicit
Public Sub GetPrice()
    Const BASE_URL As String = "https://www.ah.nl/service/rest/delegate?url="
    Dim URL As String, sResponse As String, price As String
    URL = BASE_URL & Application.WorksheetFunction.EncodeURL("/producten/product/wi3640/lu-bastogne-biscuits-original")

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    price = Split(Split(sResponse, """now"":")(1), "}")(0)
    Debug.Print price
End Sub


Parsing the JSON response:

Using Split:

You could read the whole JSON response into a JSON object using a JSON parser, for example JSONConverter.bas. Then parse that object for price. I found it simpler to use Split function to extract the required info shown below:

Split returns a zero-based, one-dimensional array containing a specified number of substrings based on splitting the input string on a specified delimiter.

In the line,

price = Split(Split(sResponse, """now"":")(1), "}")(0)

I have two nested Split statements. These consecutively split the response JSON string to extract the price 1.55.

The first split is using "now": as the delimiter resulting in an array as follows:

The target price you can see is in the string at position 1.

So, that string is extracted with:

Split(sResponse, """now"":")(1)

We then need to get just the price so use Split again to grab the 1.55 by using the delimiter "}":

Split(Split(sResponse, """now"":")(1), "}")

This results in the following array (shortened as quite long):

The price we want is now at position 0 in the new array which is why we can use the following to extract the response.

price = Split(Split(sResponse, """now"":")(1), "}")(0)

Using JSON parser:

If you want to traverse the json structure you would use the following:

Dim json As Object
Set json = JsonConverter.ParseJson(sResponse)("_embedded")("lanes")(5)("_embedded")("items")(1)("_embedded")("product")("priceLabel")
Debug.Print json("now")

After downloading and adding the JSONConverter.bas, you then add a reference to Microsoft Scripting Runtime via VBE > Tools > References. The above Set json code statement represents the path to the price, as seen in the JSON structure below. I have collapsed some detail to make the path clearer. You would insert the above couple of lines, into the original code, in place of the Split line.

In the diagram above [] denotes a collection object which needs to be accessed via index, e.g. JsonConverter.ParseJson(sResponse)("_embedded")("lanes")(5). The {} denotes a dictionary object which can be accessed by key e.g. JsonConverter.ParseJson(sResponse)("_embedded")("lanes")(5)("_embedded"). The syntax in my line,

Set json = JsonConverter.ParseJson(sResponse)("_embedded")("lanes")(5)("_embedded")("items")(1)("_embedded")("product")("priceLabel")

demonstrates the different syntax to navigate these two object types.

这篇关于使用Excel VBA用XML HTTP请求来抓取网站:等待页面完全加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆