解析VBA中的HTML内容 [英] Parse HTML content in VBA
问题描述
我的想法是在内标后面的网页上取出材质。 ID号码从一个页面变化到页面(有时是)。
这是网站的结构:
< div style =position:relative;>
< div>< / div>
< table id =list-tablewidth =100%tabindex =1cellspacing =0cellpadding =0border =0role =gridaria-multiselectable = falsearia-labelledby =gbox_list-tableclass =ui-jqgrid-btablestyle =width:930px;>
< tbody>
< tr class =jqgfirstrowrole =rowstyle =height:auto>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =1tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =2tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =3tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =4tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =5tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =6tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =7tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td role =gridcellstyle =padding-left:10pxtitle =Materialaria-descriptby =list-table _>素材< / td>
< td role =gridcellstyle =title =600D polyester。 aria-descriptby =list-table _> 600D聚酯。< / td>
< / tr>
< tr ...>
< / tr>
< / tbody>
< / table> < / DIV>
我想获得600D涤纶。
我的(不工作)代码片段是这样的:
Sub ParseMaterial()
Dim Cell As Integer
Dim ItemNbr As String
Dim AElement As Object
Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
设置IE =新的MSXML2.XMLHTTP60
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody
设置HTMLDoc =新的MSHTML.HTMLDocument
设置HTMLBody = HTMLDoc.body
对于单元格= 1到5'我逐行迭代文件
ItemNbr = Cells(Cell,3).Value'ItemNbr isin电子表格的第3列
IE.OpenGET,http://www.example.com/?item=& ItemNbr,False
IE.send
虽然IE.ReadyState<> 4
DoEvents
Wend
HTMLBody.innerHTML = IE.responseText
设置AElements = HTMLDoc.getElementById(list-table)getElementsByTagName( tr)
对于AElements中的每个AElement
如果AElement.Title =Material然后
单元格(单元格,14)= AElement.nextNode.value我在第14列
结束如果
下一个AElement
Application.Wait(Now + TimeValue(0:00:2))
下一个单元格
感谢您的帮助!
只是几件事情,希望能让你在正确的方向:
-
清理一下:删除readystate属性测试循环。 readystate属性返回的值在这种情况下永远不会改变 - 代码将在发送指令之后暂停,只有在收到服务器响应后才能恢复,否则将无法执行此操作。 readystate属性将相应设置,代码将恢复执行。您仍然应该测试就绪状态,但循环不必要
-
定位正确的HTML元素:您正在通过tr元素进行搜索,而逻辑你在代码中如何使用这些元素实际上看起来是指向td元素
-
确保这些属性实际上可用于您使用它们的对象:为了帮助您,请尝试将所有变量声明为特定对象,而不是通用对象。这将激活智能感知。如果您有困难的时候首先找到相关库中定义的对象的实际名称,将其声明为通用Object,运行代码,然后检查对象的类型 - 通过打印typename(your_object)到调试窗口。这应该是你的方式
我还包括一些可能有帮助的代码。如果你仍然无法使这个工作,你可以分享你的网址 - 这样做。
Sub getInfoWeb()
Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML。 IHTMLElementCollection
设置xhr =新建MSXML2.XMLHTTP60
对于单元格= 1至5
ItemNbr =单元格(单元格,3).Value
用xhr
。打开GET,http://www.example.com/?item=& ItemNbr,False
.send
如果.readyState = 4 And .Status = 200然后
设置doc =新的MSHTML.HTMLDocument
doc.body.innerHTML =。 responseText
Else
MsgBoxError& vbNewLine& 准备状态:& .readyState& _
vbNewLine& HTTP请求状态:& .Status
如果
结束
设置表= doc.getElementById(list-table)
设置tableCells = table.getElementsByTagName( td)
对于每个tableCell在tableCells
如果tableCell.getAttribute(title)=Material然后
单元格(单元格,14).Value = tableCell。 NextSibling.innerHTML
End If
Next tableCell
下一个单元格
End Sub
编辑:作为您在下面评论中提供的进一步信息的跟进 - 以及我添加的添加评论
'确定您的产品编号
'为源URL打开xhr,并从中检索产品编号 - 搜索
'文本的标签包括productnummer:子字符串,并从外部字符串
'OR
'中提取产品编号,如果产品编号始终包含您输入的源码url
'中附带两个0的fctkeywords - 只需构建像
'的产品编号,为此URL打开一个新的xhrhttp://www.pfconcept.com /cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=& product_number& & _search = false& rows = -1& page = 1& sidx =& sord = asc
'加载XML文档中的响应,并检索资料信息
Sub getInfoWeb()
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSXML2.DOMDocument60
Dim xmlCell As MSXML2.IXMLDOMElement
Dim xmlCells As MSXML2.IXMLDOMNodeList
Dim materialValueElement As MSXML2.IXMLDOMElement
设置xhr =新建MSXML2.XMLHTTP60
带xhr
。打开GET,http: //www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc False
.send
如果.readyState = 4 And .Status = 200然后
设置doc =新建MSXML2.DOMDocument60
doc.LoadXML .responseText
Else
MsgBoxError& vbNewLine& 准备状态:& .readyState& _
vbNewLine& HTTP请求状态:& 。$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
$ b设置xmlCells = doc.getElementsByTagName(cell)
对于xmlCell中的每个xmlCell
如果xmlCell.Text =Materiaal然后
设置materialValueElement = xmlCell.NextSibling
结束如果
下一个
MsgBox materialValueElement.Text
End Sub
EDIT2:替代自动化IE
Sub searchWebViaIE()
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Dim anchors As MSHTML.IHTMLElementCollection
Dim anchor As MSHTML.HTMLAnchorElement
Dim prodSpec As MSHTML.HTMLAnchorElement
Dim tableCells As MSHTML.IHTMLElementCollection
Dim materialValueElement As MSHTML.HTMLTableCell
Dim tableCell As MSHTML.HTMLTableCell
Set ie = New SHDocVw.InternetExplorer
With ie
.navigatehttp://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4
.Visible = True
尽管.readyState<> READYSTATE_COMPLETE或.Busy = True
DoEvents
循环
设置doc = .document
设置anchors = doc.getElementsByTagName(a)
对于每个锚点在锚点
如果InStr(anchor.innerHTML,Product Specificatie)<> 0然后
anchor.Click
退出对于
结束如果
下一个锚点
尽管.readyState<> READYSTATE_COMPLETE或.Busy = True
DoEvents
循环
结束
对于每个锚点在锚点
如果InStr(anchor.innerHTML, Product Specificatie)< 0然后
设置prodSpec = anchor
结束If
下一个锚点
设置tableCells = doc.getElementById(list-table)。getElementsByTagName(td)
如果没有tableCells是没有
对于每个tableCell在tableCells
如果tableCell.innerHTML =Materiaal然后
设置materialValueElement = tableCell.NextSibling
结束如果
下一个tableCell
结束如果
MsgBox materialValueElement.innerHTML
End Sub
I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).
My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).
Here is the structure of the website:
<div style="position:relative;">
<div></div>
<table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
<tbody>
<tr class="jqgfirstrow" role="row" style="height:auto">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
<td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
</tr>
<tr ...>
</tr>
</tbody>
</table> </div>
I would like to get "600D Polyester" as a result.
My (not working) code snippet is as is:
Sub ParseMaterial()
Dim Cell As Integer
Dim ItemNbr As String
Dim AElement As Object
Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
For Cell = 1 To 5 'I iterate through the file row by row
ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my spreadsheet
IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
HTMLBody.innerHTML = IE.responseText
Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
For Each AElement In AElements
If AElement.Title = "Material" Then
Cells(Cell, 14) = AElement.nextNode.value 'I write the material in the 14th column
End If
Next AElement
Application.Wait (Now + TimeValue("0:00:2"))
Next Cell
Thanks for your help !
Just a couple things that hopefully will get you in the right direction:
clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary
target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements
make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way
I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.
Sub getInfoWeb()
Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML.IHTMLElementCollection
Set xhr = New MSXML2.XMLHTTP60
For cell = 1 To 5
ItemNbr = Cells(cell, 3).Value
With xhr
.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set table = doc.getElementById("list-table")
Set tableCells = table.getElementsByTagName("td")
For Each tableCell In tableCells
If tableCell.getAttribute("title") = "Material" Then
Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
End If
Next tableCell
Next cell
End Sub
EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added
'Determine your product number
'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
'text include the "productnummer:" substring, and extract the product number from the outerstring
'OR
'if the product number consistently consists of the fctkeywords you are entering in your source url
'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information
Sub getInfoWeb()
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSXML2.DOMDocument60
Dim xmlCell As MSXML2.IXMLDOMElement
Dim xmlCells As MSXML2.IXMLDOMNodeList
Dim materialValueElement As MSXML2.IXMLDOMElement
Set xhr = New MSXML2.XMLHTTP60
With xhr
.Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSXML2.DOMDocument60
doc.LoadXML .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set xmlCells = doc.getElementsByTagName("cell")
For Each xmlCell In xmlCells
If xmlCell.Text = "Materiaal" Then
Set materialValueElement = xmlCell.NextSibling
End If
Next
MsgBox materialValueElement.Text
End Sub
EDIT2: an alternative automating IE
Sub searchWebViaIE()
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Dim anchors As MSHTML.IHTMLElementCollection
Dim anchor As MSHTML.HTMLAnchorElement
Dim prodSpec As MSHTML.HTMLAnchorElement
Dim tableCells As MSHTML.IHTMLElementCollection
Dim materialValueElement As MSHTML.HTMLTableCell
Dim tableCell As MSHTML.HTMLTableCell
Set ie = New SHDocVw.InternetExplorer
With ie
.navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
.Visible = True
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
Set doc = .document
Set anchors = doc.getElementsByTagName("a")
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
anchor.Click
Exit For
End If
Next anchor
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
End With
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
Set prodSpec = anchor
End If
Next anchor
Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
If Not tableCells Is Nothing Then
For Each tableCell In tableCells
If tableCell.innerHTML = "Materiaal" Then
Set materialValueElement = tableCell.NextSibling
End If
Next tableCell
End If
MsgBox materialValueElement.innerHTML
End Sub
这篇关于解析VBA中的HTML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!