html 抓取 vba 中的双类 [英] Dual class in html scraping vba
本文介绍了html 抓取 vba 中的双类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试从
通常不建议将 RegEx 用于 HTML 解析,因此有免责声明.在这种情况下处理的数据非常简单,这就是使用 RegEx 对其进行解析的原因.关于 RegEx:介绍(特别是语法)、JS 简介、VB 风格.
顺便说一句,还有另一个使用类似方法的答案:1、2、3、4、5.
I am trying to extract prices from this HTML page using the VBA code below:
Here's the HTML snippet:
<div class="box-text box-text-products">
<div class="title-wrapper">
<p class="category uppercase is-smaller no-text-overflow product-cat op-7">
Xikar Lighters
</p>
<p class="name product-title">
<a href="https://www.havanahouse.co.uk/product/xikar-allume-single-jet-flame-racing-cigar-lighter-bluewhite-stripe/">Xikar Allume Single Jet Flame Racing Cigar Lighter – Blue/White Stripe</a>
</p>
</div>
<div class="price-wrapper">
<span class="price">
<del>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">£</span>48.00
</span>
</del>
<ins>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">£</span>45.00
</span>
</ins>
</span>
</div>
</div>
<!-- box-text -->undefined</div>undefined<!-- box -->undefined</div>undefined<!-- .col-inner -->undefined</div>undefined<!-- col -->
I am using the below code but I get an error:
For Each oElement In oHtml.getElementsByClassName("woocommerce-Price-amoun t amount")
If oElement.getElementsByTagName("del") Then Exit For
If oElement.innerText <> 0 Then
Cells(counter, 3) = CDbl(oElement.innerText)
counter = counter + 1
End If
Next oElement
解决方案
Take a look at the below example:
Option Explicit
Sub Test()
Dim sUrl As String
Dim oWS As Worksheet
Dim i As Long
Dim sResp As String
Dim sCont As String
Dim oMatch
sUrl = "https://www.havanahouse.co.uk/?post_type=product"
Set oWS = ThisWorkbook.Sheets(1)
oWS.Cells.Delete
i = 1
Do
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sUrl, False
.send
sResp = .ResponseText
End With
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "<div class=""shop-container"">([\s\S]*?)<div class=""container"">"
With .Execute(sResp)
If .Count = 0 Then Exit Do
sCont = .Item(0).SubMatches(0)
End With
.Pattern = "<div class=""title-wrapper"">([\s\S]*?)</div><div class=""price-wrapper"">([\s\S]*?)</div>"
For Each oMatch In .Execute(sCont)
oWS.Cells(i, 1) = GetInnerText(oMatch.SubMatches(0))
oWS.Cells(i, 2) = GetInnerText(oMatch.SubMatches(1))
oWS.Columns.AutoFit
i = i + 1
DoEvents
Next
oWS.Cells(i, 1).Select
.Pattern = "<a class=""next page-number""[\s\S]*?href=""([^""]*)"""
With .Execute(sResp)
If .Count = 0 Then Exit Do
sUrl = .Item(0).SubMatches(0)
End With
End With
Loop
End Sub
Function GetInnerText(sText As String) As String
Static oHtmlfile As Object
Static oDiv As Object
If oHtmlfile Is Nothing Then
Set oHtmlfile = CreateObject("htmlfile")
oHtmlfile.Open
Set oDiv = oHtmlfile.createElement("div")
End If
oDiv.innerHTML = sText
GetInnerText = oDiv.innerText
End Function
The output for me is as follows:
Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx. About RegEx: introduction (especially syntax), introduction JS, VB flavor.
BTW there are another answers using the similar approach: 1, 2, 3, 4, 5.
这篇关于html 抓取 vba 中的双类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文