在 VBA 中解析 HTML 内容 [英] Parse HTML content in VBA

查看:87
本文介绍了在 VBA 中解析 HTML 内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于 HTML 解析的问题.我有一个包含一些产品的网站,我想将页面中的文本捕获到我当前的电子表格中.这个电子表格很大,但在第 3 列中包含 ItemNbr,我希望第 14 列中的文本和一行对应一个产品(项目).

我的想法是在网页上获取标签后的 Innertext 内的材料".id 号从一页到另一页变化(有时).

这是网站的结构:

<div></div><table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;"><tr class="jqgfirstrow" role="row" style="height:auto"><td ...</td><td ...</td></tr><tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td ...</td><td ...</td></tr><tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"><td role="gridcell" style="padding-left:10px" title="Material" aria-scribedby="list-table_">Material</td><td role="gridcell" style="" title="600D 涤纶."aria-scribedby="list-table_">600D 涤纶.</td></tr><tr ...></tr></tbody>

我希望得到600D 涤纶"作为结果.

我的(不工作)代码片段原样:

Sub ParseMaterial()将单元格调暗为整数Dim ItemNbr 作为字符串Dim AElement 作为对象Dim AElements As IHTMLElementCollection将 IE 调暗为 MSXML2.XMLHTTP60设置 IE = 新 MSXML2.XMLHTTP60将 HTMLDoc 变暗为 MSHTML.HTMLDocument将 HTMLBody 变暗为 MSHTML.HTMLBody设置 HTMLDoc = 新建 MSHTML.HTMLDocument设置 HTMLBody = HTMLDoc.bodyFor Cell = 1 To 5 '我逐行遍历文件ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my电子表格IE.Open "GET", "http://www.example.com/?item=" &ItemNbr,错误IE.send而 IE.ReadyState <>4事件温德HTMLBody.innerHTML = IE.responseTextSet AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")对于 AElements 中的每个 AElement如果 AElement.Title = "Material" 那么Cells(Cell, 14) = AElement.nextNode.value '我把素材写在第14列万一下一个元素Application.Wait (Now + TimeValue("0:00:2"))下一个单元格

感谢您的帮助!

解决方案

希望能帮助您找到正确方向的几点:

  • 清理一下:移除 readystate 属性测试循环.readystate 属性返回的值在此上下文中永远不会更改 - 代码将在发送指令后暂停,仅在收到服务器响应或未能这样做时恢复.将相应地设置 readystate 属性,并且代码将继续执行.您仍然应该测试就绪状态,但循环只是不必要的

  • 定位正确的 HTML 元素:您正在搜索 tr 元素 - 而您在代码中使用这些元素的逻辑实际上看起来指向 td 元素

  • 确保属性实际上可用于您正在使用它们的对象:为了帮助您解决这个问题,尝试将所有变量声明为特定对象而不是通用对象.这将激活智能感知.如果您首先很难找到相关库中定义的对象的实际名称,请将其声明为通用对象,运行您的代码,然后检查对象的类型 - 通过打印 typename(your_object)例如到调试窗口.这应该会让你上路

我还在下面添加了一些可能会有所帮助的代码.如果您仍然无法使用它,并且您可以分享您的网址 - 请这样做.

Sub getInfoWeb()将单元格调暗为整数Dim xhr As MSXML2.XMLHTTP60Dim doc As MSHTML.HTMLDocumentDim table As MSHTML.HTMLTableDim tableCells As MSHTML.IHTMLElementCollection设置 xhr = 新 MSXML2.XMLHTTP60对于单元格 = 1 到 5ItemNbr = Cells(cell, 3).Value与 xhr.打开GET",http://www.example.com/?item="&ItemNbr,错误.发送如果 .readyState = 4 并且 .Status = 200 那么设置 doc = 新建 MSHTML.HTMLDocumentdoc.body.innerHTML = .responseText别的MsgBox "错误" &vbNewLine &就绪状态:" &.readyState &_vbNewLine &"HTTP 请求状态:" &.地位万一结束于Set table = doc.getElementById("list-table")Set tableCells = table.getElementsByTagName("td")对于 tableCells 中的每个 tableCellIf tableCell.getAttribute("title") = "Material" ThenCells(cell, 14).Value = tableCell.NextSibling.innerHTML万一下一个表格单元格下一个单元格结束子

作为您在下面评论中提供的更多信息的后续行动 - 以及我添加的其他评论

'确定您的产品编号'为您的源 url 打开一个 xhr,并从那里检索产品编号 - 搜索标签'text 包含 "productnummer:" 子字符串,并从外部字符串中提取产品编号'或者'如果产品编号始终由您在源网址中输入的 fctkeywords 组成'附加两个0" - 只需像这样构建产品编号'为此网址打开一个新的 xhr "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc="&product_number &&_search=false&rows=-1&page=1&sidx=&sord=asc"'在 XML 文档中加载响应,并检索材料信息子 getInfoWeb()Dim xhr As MSXML2.XMLHTTP60Dim doc As MSXML2.DOMDocument60将 xmlCell 调暗为 MSXML2.IXMLDOMElementDim xmlCells 作为 MSXML2.IXMLDOMNodeListDim materialValueElement 作为 MSXML2.IXMLDOMElement设置 xhr = 新 MSXML2.XMLHTTP60与 xhr.打开GET",http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", 假.发送如果 .readyState = 4 并且 .Status = 200 那么设置 doc = 新建 MSXML2.DOMDocument60doc.LoadXML .responseText别的MsgBox "错误" &vbNewLine &就绪状态:" &.readyState &_vbNewLine &"HTTP 请求状态:" &.地位万一结束于设置 xmlCells = doc.getElementsByTagName("cell")对于 xmlCells 中的每个 xmlCell如果 xmlCell.Text = "Materiaal" 那么设置 materialValueElement = xmlCell.NextSibling万一下一个MsgBox materialValueElement.Text结束子

另一种自动化 IE

子搜索WebViaIE()Dim 即作为 SHDocVw.InternetExplorerDim doc As MSHTML.HTMLDocumentDim 锚定为 MSHTML.IHTMLElementCollectionDim 锚作为 MSHTML.HTMLAnchorElementDim prodSpec As MSHTML.HTMLAnchorElementDim tableCells As MSHTML.IHTMLElementCollectionDim materialValueElement 作为 MSHTML.HTMLTableCellDim tableCell 作为 MSHTML.HTMLTableCell设置 ie = 新的 SHDocVw.InternetExplorer与即.navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4".可见 = 真做 While .readyState <>READYSTATE_COMPLETE 或 .Busy = True事件环形设置 doc = .document设置锚点 = doc.getElementsByTagName("a")对于每个锚点中的锚点如果 InStr(anchor.innerHTML, "Product Specification") <>0 那么锚点.点击退出万一下一个锚点做 While .readyState <>READYSTATE_COMPLETE 或 .Busy = True事件环形结束于对于每个锚点中的锚点如果 InStr(anchor.innerHTML, "Product Specification") <>0 那么设置 prodSpec = 锚点万一下一个锚点设置 tableCells = doc.getElementById("list-table").getElementsByTagName("td")如果不是 tableCells 是什么,那么对于 tableCells 中的每个 tableCell如果 tableCell.innerHTML = "Materiaal" 那么设置 materialValueElement = tableCell.NextSibling万一下一个表格单元格万一MsgBox materialValueElement.innerHTML结束子

I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).

My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).

Here is the structure of the website:

<div style="position:relative;">
    <div></div>
    <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
        <tbody>
            <tr class="jqgfirstrow" role="row" style="height:auto">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
                <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
            </tr>           
            <tr ...>
            </tr>
        </tbody>
    </table> </div>

I would like to get "600D Polyester" as a result.

My (not working) code snippet is as is:

Sub ParseMaterial()

    Dim Cell As Integer
    Dim ItemNbr As String

    Dim AElement As Object
    Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60

Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody

Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body

For Cell = 1 To 5                            'I iterate through the file row by row

    ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet

    IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
    IE.send

    While IE.ReadyState <> 4
        DoEvents
    Wend

    HTMLBody.innerHTML = IE.responseText

    Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
    For Each AElement In AElements
        If AElement.Title = "Material" Then
            Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column
        End If
    Next AElement

        Application.Wait (Now + TimeValue("0:00:2"))

Next Cell

Thanks for your help !

解决方案

Just a couple things that hopefully will get you in the right direction:

  • clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary

  • target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements

  • make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way

I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.

Sub getInfoWeb()

    Dim cell As Integer
    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSHTML.HTMLDocument
    Dim table As MSHTML.HTMLTable
    Dim tableCells As MSHTML.IHTMLElementCollection

    Set xhr = New MSXML2.XMLHTTP60

    For cell = 1 To 5

        ItemNbr = Cells(cell, 3).Value

        With xhr

            .Open "GET", "http://www.example.com/?item=" & ItemNbr, False
            .send

            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSHTML.HTMLDocument
                doc.body.innerHTML = .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If

        End With

        Set table = doc.getElementById("list-table")
        Set tableCells = table.getElementsByTagName("td")

        For Each tableCell In tableCells
            If tableCell.getAttribute("title") = "Material" Then
                Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
            End If
        Next tableCell

    Next cell

End Sub

EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added

'Determine your product number
    'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
    'text include the "productnummer:" substring, and extract the product number from the outerstring
    'OR
    'if the product number consistently consists of the fctkeywords you are entering in your source url
    'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information

Sub getInfoWeb()

    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSXML2.DOMDocument60
    Dim xmlCell As MSXML2.IXMLDOMElement
    Dim xmlCells As MSXML2.IXMLDOMNodeList
    Dim materialValueElement As MSXML2.IXMLDOMElement

    Set xhr = New MSXML2.XMLHTTP60

        With xhr

            .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
            .send

            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSXML2.DOMDocument60
                doc.LoadXML .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If

        End With

        Set xmlCells = doc.getElementsByTagName("cell")

        For Each xmlCell In xmlCells
            If xmlCell.Text = "Materiaal" Then
                Set materialValueElement = xmlCell.NextSibling
            End If
        Next

        MsgBox materialValueElement.Text

End Sub

EDIT2: an alternative automating IE

Sub searchWebViaIE()
    Dim ie As SHDocVw.InternetExplorer
    Dim doc As MSHTML.HTMLDocument
    Dim anchors As MSHTML.IHTMLElementCollection
    Dim anchor As MSHTML.HTMLAnchorElement
    Dim prodSpec As MSHTML.HTMLAnchorElement
    Dim tableCells As MSHTML.IHTMLElementCollection
    Dim materialValueElement As MSHTML.HTMLTableCell
    Dim tableCell As MSHTML.HTMLTableCell

    Set ie = New SHDocVw.InternetExplorer

    With ie
        .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
        .Visible = True

        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop

        Set doc = .document

        Set anchors = doc.getElementsByTagName("a")

        For Each anchor In anchors
            If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                anchor.Click
                Exit For
            End If
        Next anchor

        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop

    End With

    For Each anchor In anchors
        If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
            Set prodSpec = anchor
        End If
    Next anchor

    Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")

    If Not tableCells Is Nothing Then
        For Each tableCell In tableCells
            If tableCell.innerHTML = "Materiaal" Then
                Set materialValueElement = tableCell.NextSibling
            End If
        Next tableCell
    End If

    MsgBox materialValueElement.innerHTML

End Sub

这篇关于在 VBA 中解析 HTML 内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆