使用组合框VBA中的选项来抓取网站数据 [英] Scraping website data with options in combo box VBA
问题描述
我想从以下网站上刮取产品名称(Cohiba Robusto),产品尺寸(单支雪茄,3包,25盒)和价格(33.65英镑,90英镑,730英镑):
我正在使用下面的代码,该代码给出错误(对象变量或未设置变量").
将对此表示感谢.
Sub getproducts()表格("JJFox").选择作为HTMLDocument的Dim oHtml昏暗的oElement作为对象暗元素作为IHTMLElementCollection将Dim文档另存为HTMLDocument设置oHtml = New HTMLDocument'单元格(1,6)=时间()lastrow = ActiveSheet.Cells(Rows.Count,"A").End(xlUp).Rowcnt = lastrow + 1counter1 = cnt昏暗的gg作为字符串gg ="https://www.jjfox.co.uk/cohiba-robusto-621.html"昏暗的objHTTP作为新的WinHttp.WinHttpRequest网址= ggobjHTTP.Open"POST",网址,FalseobjHTTP.setRequestHeader"Content-Type","application/json"objHTTP.send("{""key":null,"来自":" me@me.com",到":null,"cc","null","bcc","null","date","null","subject","my Subject"","body"":null,"attachments":null}")oHtml.body.innerHTML = objHTTP.responseText'Cells(rw,2)= oHtml.getElementsByTagName("description").innerText'如果不是.Document.querySelector("button [aria-label ='Close']")则什么也没有'.Document.querySelector("button [aria-label ='Close']")).点击' 万一txttitle = oHtml.getElementsByClassName("productcart")(0).innerTexttxttitlehtml = oHtml.getElementsByClassName("packsize")(0).innerHTMLtxttitle = Mid(txttitle,1,InStr(1,txttitle,Chr(10)))'Debug.Print txttitlehtml'txttitle2 = oHtml.getElementsByClassName(价格")(0).innerText昏暗的文字作为字符串文字= GetHTML(gg)开始= InStr(1,文本,"spConfig =")endS = InStr(开始+ 1,文本,"spConfig")如果开始= 0那么单元格(counter1,1)= txttitle单元格(counter1,2)=单"开始= InStr(开始+ 1,文本,"productPrice")endl = InStr(开始+ 1,文本,,")像元(counter1,3)= Val(Mid(文本,开始+ 14,endl-(开始+ 14)))单元格(counter1,4)="JJFox"单元格(counter1,5)= Now()单元格(counter1,7)= gg'链接到页面计数器1 =计数器1 + 1别的文字=中(文字,开始,结束S-开始)'Debug.Print文本'找到多少个包装选项可用myTxt =文字countTxt ="label"bb =(Len(myTxt)-Len(替换(myTxt,countTxt,"))))/Len(countTxt)-1'结束查找/////////////////////////////////////varlabel ="class ="&Chr(34)&标签"&铬(34)开始= InStr(1,文本,标签")+ 1文字=中(文字,开头,Len(文字))对于i = 1到bb开始= InStr(1,文本,标签")如果InStr(开始,文本,标签"),则'显示元素的属性单元格(counter1,1)= txttitle单元格(counter1,2)= Mid(文本,开始+ 8,InStr(开始,文本,"\")-(开始+ 8))开始= InStr(开始+ 1,文本,"oldPrice")endl = InStr(开始+ 1,文本,,")Cells(counter1,3).FormulaR1C1 = Val(Mid(Text,starts + 11,endl-(starts + 11))))'Debug.Print Val(Mid(Text,startS + chrs,6))单元格(counter1,4)="JJFox"单元格(counter1,5)= Now()开始=开始+ 1文字=中(文字,开头,Len(文字))单元格(counter1,7)= gg'链接到页面计数器1 =计数器1 + 1万一接下来我万一'Cells(2,6)= Time()结束子函数GetHTML(URL作为字符串)作为字符串使用CreateObject("MSXML2.ServerXMLHTTP.6.0").打开"GET",URL,False.发送GetHTML = .responseText结束于结束功能
价格和标签是从脚本标签动态提取的,您可以使用json解析器将其内容解析为json.但是,您需要从html中获取名称.
对html和css有一点了解,就可以很容易地定义一个css模式来针对目标脚本节点:
.fieldset [type ='text/x-magento-init']
查找具有属性 text/x-magento-init
的 type
属性的子 script
和父类 fieldset
.
我使用的效率稍低(您不会注意到):
对于i = 1到optionsCollection.Count
仅仅是因为我知道集合很小,并且允许我通过一个循环将其索引为两个变量.
Json库:
我使用jsonconverter.bas.从此处下载原始代码,并将其添加到标准模块中称为JsonConverter.从复制的代码中删除最上面的属性行.
然后您需要去
VBE>工具>参考书目>添加对以下内容的引用:
Microsoft脚本运行时Microsoft HTML对象库Microsoft XML库.
在VBA for json中,[]表示集合,{}表示字典.
选项显式公共子GetCigarData()'<VBE>工具>参考:'Microsoft脚本运行时'Microsoft HTML对象库'Microsoft XML库Dim json作为对象,html作为MSHTML.HTMLDocument,xhr作为MSXML2.XMLHTTP60,ws作为工作表设置ws = ThisWorkbook.Worksheets("Sheet1")设置xhr = New MSXML2.XMLHTTP60设置html =新的MSHTML.HTMLDocument与xhr.打开"GET","https://www.jjfox.co.uk/cohiba-robusto-621.html",False.setRequestHeader用户代理","Mozilla/5.0".发送html.body.innerHTML = .responseText结束于设置json = jsonConverter.ParseJson(html.querySelector(.fieldset [type ='text/x-magento-init']").innerHTML)(#product_addtocart_form")(可配置")("spConfig")暗淡的价格作为Scripting.Dictionary,选项作为Scripting.Dictionary,options作为集合设置价格= json("optionPrices")设置选项= json(属性")设置optionsCollection = options(options.Keys(0))("options")Dim results()作为变量,headers()作为变量,i作为long,名称作为StringReDim结果(1到optionsCollection.Count,1到3)名称= html.querySelector(.base").innerText对于i = 1 To optionsCollection.Count结果(i,1)=名称results(i,2)= optionsCollection.item(i)(标签")结果(i,3)=价格(prices.Keys(i-1))("finalPrice")(金额")下一个标头= Array(名称",大小",价格")与ws.Cells(1,1).Resize(1,UBound(headers)+1)=标头.Cells(2,1).Resize(UBound(results,1),UBound(results,2))=结果结束于结束子
了解有关CSS选择器的信息:
Hi I am trying to scrape the Product name (Cohiba Robusto), Product Size (Single Cigar, Pack of 3, Box of 25) and prices (£33.65, £90, £730) from this website: https://www.jjfox.co.uk/cohiba-robusto-621.html
I am trying to get something like this:
I am using the code below, which gives an error ("Object variable or with variable not set").
Will appreciate any help with this.
Sub getproducts()
Sheets("JJFox").Select
Dim oHtml As HTMLDocument
Dim oElement As Object
Dim Elements As IHTMLElementCollection
Dim Document As HTMLDocument
Set oHtml = New HTMLDocument
'Cells(1, 6) = Time()
lastrow = ActiveSheet.Cells(Rows.Count, "A").End(xlUp).Row
cnt = lastrow + 1
counter1 = cnt
Dim gg As String
gg = "https://www.jjfox.co.uk/cohiba-robusto-621.html"
Dim objHTTP As New WinHttp.WinHttpRequest
url = gg
objHTTP.Open "POST", url, False
objHTTP.setRequestHeader "Content-Type", "application/json"
objHTTP.send ("{""key"":null,""from"":""me@me.com"",""to"":null,""cc"":null,""bcc"":null,""date"":null,""subject"":""My Subject"",""body"":null,""attachments"":null}")
oHtml.body.innerHTML = objHTTP.responseText
'Cells(rw, 2) = oHtml.getElementsByTagName("description").innerText
' If Not .Document.querySelector("button[aria-label='Close']") Is Nothing Then
' .Document.querySelector("button[aria-label='Close']").Click
' End If
txttitle = oHtml.getElementsByClassName("productcart")(0).innerText
txttitlehtml = oHtml.getElementsByClassName("packsize")(0).innerHTML
txttitle = Mid(txttitle, 1, InStr(1, txttitle, Chr(10)))
'Debug.Print txttitlehtml
'txttitle2 = oHtml.getElementsByClassName("price")(0).innerText
Dim Text As String
Text = GetHTML(gg)
starts = InStr(1, Text, "spConfig =")
endS = InStr(starts + 1, Text, "spConfig")
If starts = 0 Then
Cells(counter1, 1) = txttitle
Cells(counter1, 2) = "Single"
starts = InStr(starts + 1, Text, "productPrice")
endl = InStr(starts + 1, Text, ",")
Cells(counter1, 3) = Val(Mid(Text, starts + 14, endl - (starts + 14)))
Cells(counter1, 4) = "JJFox"
Cells(counter1, 5) = Now()
Cells(counter1, 7) = gg ' link to the page
counter1 = counter1 + 1
Else
Text = Mid(Text, starts, endS - starts)
'Debug.Print Text
'find how many pack options are avaialble
myTxt = Text
countTxt = "label"
bb = (Len(myTxt) - Len(replace(myTxt, countTxt, ""))) / Len(countTxt) - 1
'End find////////////////////////////////////
varlabel = "class=" & Chr(34) & "label" & Chr(34)
starts = InStr(1, Text, "label") + 1
Text = Mid(Text, starts, Len(Text))
For i = 1 To bb
starts = InStr(1, Text, "label")
If InStr(starts, Text, "label") Then
'Show the element's properties
Cells(counter1, 1) = txttitle
Cells(counter1, 2) = Mid(Text, starts + 8, InStr(starts, Text, " \") - (starts + 8))
starts = InStr(starts + 1, Text, "oldPrice")
endl = InStr(starts + 1, Text, ",")
Cells(counter1, 3).FormulaR1C1 = Val(Mid(Text, starts + 11, endl - (starts + 11)))
'Debug.Print Val(Mid(Text, startS + chrs, 6))
Cells(counter1, 4) = "JJFox"
Cells(counter1, 5) = Now()
starts = starts + 1
Text = Mid(Text, starts, Len(Text))
Cells(counter1, 7) = gg ' link to the page
counter1 = counter1 + 1
End If
Next i
End If
'Cells(2, 6) = Time()
End Sub
Function GetHTML(url As String) As String
With CreateObject("MSXML2.ServerXMLHTTP.6.0")
.Open "GET", url, False
.send
GetHTML = .responseText
End With
End Function
The prices and labels are pulled dynamically from a script tag who content you can parse as json with a json parser. You need to grab the name from the html however.
With a little knowledge of html and css, it is easy enough to define a css pattern to target the script node of interest with:
.fieldset [type='text/x-magento-init']
That looks for a child script
with type
attribute having attribute value text/x-magento-init
, and a parent with class fieldset
.
I have used a tiny bit less efficient (you won't notice):
For i = 1 To optionsCollection.Count
Simply because I know the collection is small and to allow me to index into two variables with a single loop.
Json library:
I use jsonconverter.bas. Download raw code from here and add to standard module called JsonConverter . Remove the top Attribute line from the copied code.
You then need to go:
VBE > Tools > References > Add references to:
Microsoft Scripting Runtime
Microsoft HTML Object Library
Microsoft XML Library.
In VBA for json the [] denotes a collection and the {} represents a dictionary.
Option Explicit
Public Sub GetCigarData()
'< VBE > Tools > References:
'Microsoft Scripting Runtime
'Microsoft HTML Object Library
'Microsoft XML Library
Dim json As Object, html As MSHTML.HTMLDocument, xhr As MSXML2.XMLHTTP60, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.jjfox.co.uk/cohiba-robusto-621.html", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Set json = jsonConverter.ParseJson(html.querySelector(".fieldset [type='text/x-magento-init']").innerHTML)("#product_addtocart_form")("configurable")("spConfig")
Dim prices As Scripting.Dictionary, options As Scripting.Dictionary, optionsCollection As Collection
Set prices = json("optionPrices")
Set options = json("attributes")
Set optionsCollection = options(options.Keys(0))("options")
Dim results() As Variant, headers() As Variant, i As Long, name As String
ReDim results(1 To optionsCollection.Count, 1 To 3)
name = html.querySelector(".base").innerText
For i = 1 To optionsCollection.Count
results(i, 1) = name
results(i, 2) = optionsCollection.item(i)("label")
results(i, 3) = prices(prices.Keys(i - 1))("finalPrice")("amount")
Next
headers = Array("Name", "Size", "Price")
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Read about css selectors:
这篇关于使用组合框VBA中的选项来抓取网站数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!