在 VBA 中解析 HTML 内容 [英] Parse HTML content in VBA
问题描述
我有一个关于 HTML 解析的问题.我有一个包含一些产品的网站,我想将页面中的文本捕获到我当前的电子表格中.这个电子表格很大,但在第 3 列中包含 ItemNbr,我希望第 14 列中的文本和一行对应一个产品(项目).
我的想法是在网页上获取标签后的 Innertext 内的材料".id 号从一页到另一页变化(有时).
这是网站的结构:
<div></div><table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;"><tr class="jqgfirstrow" role="row" style="height:auto"><td ...</td><td ...</td></tr><tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td role="gridcell" style="padding-left:10px" title="Material" aria-scribedby="list-table_">Material</td><td role="gridcell" style="" title="600D 涤纶."aria-scribedby="list-table_">600D 涤纶.</td></tr><tr ...></tr></tbody>我希望得到600D 涤纶"作为结果.
我的(不工作)代码片段原样:
Sub ParseMaterial()将单元格调暗为整数Dim ItemNbr 作为字符串Dim AElement 作为对象Dim AElements As IHTMLElementCollection将 IE 调暗为 MSXML2.XMLHTTP60设置 IE = 新 MSXML2.XMLHTTP60将 HTMLDoc 变暗为 MSHTML.HTMLDocument将 HTMLBody 变暗为 MSHTML.HTMLBody设置 HTMLDoc = 新建 MSHTML.HTMLDocument设置 HTMLBody = HTMLDoc.bodyFor Cell = 1 To 5 '我逐行遍历文件ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my电子表格IE.Open "GET", "http://www.example.com/?item=" &ItemNbr,错误IE.send而 IE.ReadyState <>4事件温德HTMLBody.innerHTML = IE.responseTextSet AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")对于 AElements 中的每个 AElement如果 AElement.Title = "Material" 那么Cells(Cell, 14) = AElement.nextNode.value '我把素材写在第14列万一下一个元素Application.Wait (Now + TimeValue("0:00:2"))下一个单元格
感谢您的帮助!
解决方案 希望能帮助您找到正确方向的几点:
清理一下:移除 readystate 属性测试循环.readystate 属性返回的值在此上下文中永远不会更改 - 代码将在发送指令后暂停,仅在收到服务器响应或未能这样做时恢复.将相应地设置 readystate 属性,并且代码将继续执行.您仍然应该测试就绪状态,但循环只是不必要的
定位正确的 HTML 元素:您正在搜索 tr 元素 - 而您在代码中使用这些元素的逻辑实际上看起来指向 td 元素
确保属性实际上可用于您正在使用它们的对象:为了帮助您解决这个问题,尝试将所有变量声明为特定对象而不是通用对象.这将激活智能感知.如果您首先很难找到相关库中定义的对象的实际名称,请将其声明为通用对象,运行您的代码,然后检查对象的类型 - 通过打印 typename(your_object)例如到调试窗口.这应该会让你上路
我还在下面添加了一些可能会有所帮助的代码.如果您仍然无法使用它,并且您可以分享您的网址 - 请这样做.
Sub getInfoWeb()将单元格调暗为整数Dim xhr As MSXML2.XMLHTTP60Dim doc As MSHTML.HTMLDocumentDim table As MSHTML.HTMLTableDim tableCells As MSHTML.IHTMLElementCollection设置 xhr = 新 MSXML2.XMLHTTP60对于单元格 = 1 到 5ItemNbr = Cells(cell, 3).Value与 xhr.打开GET",http://www.example.com/?item="&ItemNbr,错误.发送如果 .readyState = 4 并且 .Status = 200 那么设置 doc = 新建 MSHTML.HTMLDocumentdoc.body.innerHTML = .responseText别的MsgBox "错误" &vbNewLine &就绪状态:" &.readyState &_vbNewLine &"HTTP 请求状态:" &.地位万一结束于Set table = doc.getElementById("list-table")Set tableCells = table.getElementsByTagName("td")对于 tableCells 中的每个 tableCellIf tableCell.getAttribute("title") = "Material" ThenCells(cell, 14).Value = tableCell.NextSibling.innerHTML万一下一个表格单元格下一个单元格结束子
作为您在下面评论中提供的更多信息的后续行动 - 以及我添加的其他评论
'确定您的产品编号'为您的源 url 打开一个 xhr,并从那里检索产品编号 - 搜索标签'text 包含 "productnummer:" 子字符串,并从外部字符串中提取产品编号'或者'如果产品编号始终由您在源网址中输入的 fctkeywords 组成'附加两个0" - 只需像这样构建产品编号'为此网址打开一个新的 xhr "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc="&product_number &&_search=false&rows=-1&page=1&sidx=&sord=asc"'在 XML 文档中加载响应,并检索材料信息子 getInfoWeb()Dim xhr As MSXML2.XMLHTTP60Dim doc As MSXML2.DOMDocument60将 xmlCell 调暗为 MSXML2.IXMLDOMElementDim xmlCells 作为 MSXML2.IXMLDOMNodeListDim materialValueElement 作为 MSXML2.IXMLDOMElement设置 xhr = 新 MSXML2.XMLHTTP60与 xhr.打开GET",http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", 假.发送如果 .readyState = 4 并且 .Status = 200 那么设置 doc = 新建 MSXML2.DOMDocument60doc.LoadXML .responseText别的MsgBox "错误" &vbNewLine &就绪状态:" &.readyState &_vbNewLine &"HTTP 请求状态:" &.地位万一结束于设置 xmlCells = doc.getElementsByTagName("cell")对于 xmlCells 中的每个 xmlCell如果 xmlCell.Text = "Materiaal" 那么设置 materialValueElement = xmlCell.NextSibling万一下一个MsgBox materialValueElement.Text结束子
另一种自动化 IE
子搜索WebViaIE()Dim 即作为 SHDocVw.InternetExplorerDim doc As MSHTML.HTMLDocumentDim 锚定为 MSHTML.IHTMLElementCollectionDim 锚作为 MSHTML.HTMLAnchorElementDim prodSpec As MSHTML.HTMLAnchorElementDim tableCells As MSHTML.IHTMLElementCollectionDim materialValueElement 作为 MSHTML.HTMLTableCellDim tableCell 作为 MSHTML.HTMLTableCell设置 ie = 新的 SHDocVw.InternetExplorer与即.navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4".可见 = 真做 While .readyState <>READYSTATE_COMPLETE 或 .Busy = True事件环形设置 doc = .document设置锚点 = doc.getElementsByTagName("a")对于每个锚点中的锚点如果 InStr(anchor.innerHTML, "Product Specification") <>0 那么锚点.点击退出万一下一个锚点做 While .readyState <>READYSTATE_COMPLETE 或 .Busy = True事件环形结束于对于每个锚点中的锚点如果 InStr(anchor.innerHTML, "Product Specification") <>0 那么设置 prodSpec = 锚点万一下一个锚点设置 tableCells = doc.getElementById("list-table").getElementsByTagName("td")如果不是 tableCells 是什么,那么对于 tableCells 中的每个 tableCell如果 tableCell.innerHTML = "Materiaal" 那么设置 materialValueElement = tableCell.NextSibling万一下一个表格单元格万一MsgBox materialValueElement.innerHTML结束子
I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).
My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).
Here is the structure of the website:
<div style="position:relative;">
<div></div>
<table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
<tbody>
<tr class="jqgfirstrow" role="row" style="height:auto">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td ...</td>
<td ...</td>
</tr>
<tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
<td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
<td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
</tr>
<tr ...>
</tr>
</tbody>
</table> </div>
I would like to get "600D Polyester" as a result.
My (not working) code snippet is as is:
Sub ParseMaterial()
Dim Cell As Integer
Dim ItemNbr As String
Dim AElement As Object
Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
For Cell = 1 To 5 'I iterate through the file row by row
ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my spreadsheet
IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
HTMLBody.innerHTML = IE.responseText
Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
For Each AElement In AElements
If AElement.Title = "Material" Then
Cells(Cell, 14) = AElement.nextNode.value 'I write the material in the 14th column
End If
Next AElement
Application.Wait (Now + TimeValue("0:00:2"))
Next Cell
Thanks for your help !
解决方案 Just a couple things that hopefully will get you in the right direction:
clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary
target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements
make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way
I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.
Sub getInfoWeb()
Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML.IHTMLElementCollection
Set xhr = New MSXML2.XMLHTTP60
For cell = 1 To 5
ItemNbr = Cells(cell, 3).Value
With xhr
.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set table = doc.getElementById("list-table")
Set tableCells = table.getElementsByTagName("td")
For Each tableCell In tableCells
If tableCell.getAttribute("title") = "Material" Then
Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
End If
Next tableCell
Next cell
End Sub
EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added
'Determine your product number
'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
'text include the "productnummer:" substring, and extract the product number from the outerstring
'OR
'if the product number consistently consists of the fctkeywords you are entering in your source url
'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information
Sub getInfoWeb()
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSXML2.DOMDocument60
Dim xmlCell As MSXML2.IXMLDOMElement
Dim xmlCells As MSXML2.IXMLDOMNodeList
Dim materialValueElement As MSXML2.IXMLDOMElement
Set xhr = New MSXML2.XMLHTTP60
With xhr
.Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSXML2.DOMDocument60
doc.LoadXML .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set xmlCells = doc.getElementsByTagName("cell")
For Each xmlCell In xmlCells
If xmlCell.Text = "Materiaal" Then
Set materialValueElement = xmlCell.NextSibling
End If
Next
MsgBox materialValueElement.Text
End Sub
EDIT2: an alternative automating IE
Sub searchWebViaIE()
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Dim anchors As MSHTML.IHTMLElementCollection
Dim anchor As MSHTML.HTMLAnchorElement
Dim prodSpec As MSHTML.HTMLAnchorElement
Dim tableCells As MSHTML.IHTMLElementCollection
Dim materialValueElement As MSHTML.HTMLTableCell
Dim tableCell As MSHTML.HTMLTableCell
Set ie = New SHDocVw.InternetExplorer
With ie
.navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
.Visible = True
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
Set doc = .document
Set anchors = doc.getElementsByTagName("a")
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
anchor.Click
Exit For
End If
Next anchor
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
End With
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
Set prodSpec = anchor
End If
Next anchor
Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
If Not tableCells Is Nothing Then
For Each tableCell In tableCells
If tableCell.innerHTML = "Materiaal" Then
Set materialValueElement = tableCell.NextSibling
End If
Next tableCell
End If
MsgBox materialValueElement.innerHTML
End Sub
这篇关于在 VBA 中解析 HTML 内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文
登录
关闭
扫码关注1秒登录
发送“验证码”获取
|
15天全站免登陆