解析VBA中的HTML内容 [英] Parse HTML content in VBA

查看:5208
本文介绍了解析VBA中的HTML内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与HTML解析有关的问题。我有一个网站有一些产品,我想在我的当前电子表格中的页面内的文本。此电子表格相当大,但在第3列中包含ItemNbr,我预计第14列中的文本和一行对应于一个产品(项目)。



我的想法是在内标后面的网页上取出材质。 ID号码从一个页面变化到页面(有时是)。



这是网站的结构:

 < div style =position:relative;> 
< div>< / div>
< table id =list-tablewidth =100%tabindex =1cellspacing =0cellpadding =0border =0role =gridaria-multiselectable = falsearia-labelledby =gbox_list-tableclass =ui-jqgrid-btablestyle =width:930px;>
< tbody>
< tr class =jqgfirstrowrole =rowstyle =height:auto>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =1tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =2tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =3tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =4tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =5tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =6tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td ...< / td>
< td ...< / td>
< / tr>
< tr role =rowid =7tabindex = - 1class =ui-widget-content jqgrow ui-row-ltr>
< td role =gridcellstyle =padding-left:10pxtitle =Materialaria-descriptby =list-table _>素材< / td>
< td role =gridcellstyle =title =600D polyester。 aria-descriptby =list-table _> 600D聚酯。< / td>
< / tr>
< tr ...>
< / tr>
< / tbody>
< / table> < / DIV>

我想获得600D涤纶。



我的(不工作)代码片段是这样的:

  Sub ParseMaterial()

Dim Cell As Integer
Dim ItemNbr As String

Dim AElement As Object
Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
设置IE =新的MSXML2.XMLHTTP60

Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody

设置HTMLDoc =新的MSHTML.HTMLDocument
设置HTMLBody = HTMLDoc.body

对于单元格= 1到5'我逐行迭代文件

ItemNbr = Cells(Cell,3).Value'ItemNbr isin电子表格的第3列

IE.OpenGET,http://www.example.com/?item=& ItemNbr,False
IE.send

虽然IE.ReadyState<> 4
DoEvents
Wend

HTMLBody.innerHTML = IE.responseText

设置AElements = HTMLDoc.getElementById(list-table)getElementsByTagName( tr)
对于AElements中的每个AElement
如果AElement.Title =Material然后
单元格(单元格,14)= AElement.nextNode.value我在第14列
结束如果
下一个AElement

Application.Wait(Now + TimeValue(0:00:2))

下一个单元格

感谢您的帮助!

解决方案

只是几件事情,希望能让你在正确的方向:




  • 清理一下:删除readystate属性测试循环。 readystate属性返回的值在这种情况下永远不会改变 - 代码将在发送指令之后暂停,只有在收到服务器响应后才能恢复,否则将无法执行此操作。 readystate属性将相应设置,代码将恢复执行。您仍然应该测试就绪状态,但循环不必要


  • 定位正确的HTML元素:您正在通过tr元素进行搜索,而逻辑你在代码中如何使用这些元素实际上看起来是指向td元素


  • 确保这些属性实际上可用于您使用它们的对象:为了帮助您,请尝试将所有变量声明为特定对象,而不是通用对象。这将激活智能感知。如果您有困难的时候首先找到相关库中定义的对象的实际名称,将其声明为通用Object,运行代码,然后检查对象的类型 - 通过打印typename(your_object)到调试窗口。这应该是你的方式




我还包括一些可能有帮助的代码。如果你仍然无法使这个工作,你可以分享你的网址 - 这样做。

  Sub getInfoWeb()

Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML。 IHTMLElementCollection

设置xhr =新建MSXML2.XMLHTTP60

对于单元格= 1至5

ItemNbr =单元格(单元格,3).Value

用xhr

。打开GET,http://www.example.com/?item=& ItemNbr,False
.send

如果.readyState = 4 And .Status = 200然后
设置doc =新的MSHTML.HTMLDocument
doc.body.innerHTML =。 responseText
Else
MsgBoxError& vbNewLine& 准备状态:& .readyState& _
vbNewLine& HTTP请求状态:& .Status
如果

结束

设置表= doc.getElementById(list-table)
设置tableCells = table.getElementsByTagName( td)

对于每个tableCell在tableCells
如果tableCell.getAttribute(title)=Material然后
单元格(单元格,14).Value = tableCell。 NextSibling.innerHTML
End If
Next tableCell

下一个单元格

End Sub

编辑:作为您在下面评论中提供的进一步信息的跟进 - 以及我添加的添加评论

 '确定您的产品编号
'为源URL打开xhr,并从中检索产品编号 - 搜索
'文本的标签包括productnummer:子字符串,并从外部字符串
'OR
'中提取产品编号,如果产品编号始终包含您输入的源码url
'中附带两个0的fctkeywords - 只需构建像
'的产品编号,为此URL打开一个新的xhrhttp://www.pfconcept.com /cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=& product_number& & _search = false& rows = -1& page = 1& sidx =& sord = asc
'加载XML文档中的响应,并检索资料信息

Sub getInfoWeb()

Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSXML2.DOMDocument60
Dim xmlCell As MSXML2.IXMLDOMElement
Dim xmlCells As MSXML2.IXMLDOMNodeList
Dim materialValueElement As MSXML2.IXMLDOMElement

设置xhr =新建MSXML2.XMLHTTP60

带xhr

。打开GET,http: //www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc False
.send

如果.readyState = 4 And .Status = 200然后
设置doc =新建MSXML2.DOMDocument60
doc.LoadXML .responseText
Else
MsgBoxError& vbNewLine& 准备状态:& .readyState& _
vbNewLine& HTTP请求状态:& 。$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $


$ b设置xmlCells = doc.getElementsByTagName(cell)

对于xmlCell中的每个xmlCell
如果xmlCell.Text =Materiaal然后
设置materialValueElement = xmlCell.NextSibling
结束如果
下一个

MsgBox materialValueElement.Text

End Sub

EDIT2:替代自动化IE

  Sub searchWebViaIE()
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Dim anchors As MSHTML.IHTMLElementCollection
Dim anchor As MSHTML.HTMLAnchorElement
Dim prodSpec As MSHTML.HTMLAnchorElement
Dim tableCells As MSHTML.IHTMLElementCollection
Dim materialValueElement As MSHTML.HTMLTableCell
Dim tableCell As MSHTML.HTMLTableCell

Set ie = New SHDocVw.InternetExplorer

With ie
.navigatehttp://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4
.Visible = True

尽管.readyState<> READYSTATE_COMPLETE或.Busy = True
DoEvents
循环

设置doc = .document

设置anchors = doc.getElementsByTagName(a)

对于每个锚点在锚点
如果InStr(anchor.innerHTML,Product Specificatie)<> 0然后
anchor.Click
退出对于
结束如果
下一个锚点

尽管.readyState<> READYSTATE_COMPLETE或.Busy = True
DoEvents
循环

结束

对于每个锚点在锚点
如果InStr(anchor.innerHTML, Product Specificatie)< 0然后
设置prodSpec = anchor
结束If
下一个锚点

设置tableCells = doc.getElementById(list-table)。getElementsByTagName(td)

如果没有tableCells是没有
对于每个tableCell在tableCells
如果tableCell.innerHTML =Materiaal然后
设置materialValueElement = tableCell.NextSibling
结束如果
下一个tableCell
结束如果

MsgBox materialValueElement.innerHTML

End Sub


I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).

My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).

Here is the structure of the website:

<div style="position:relative;">
    <div></div>
    <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
        <tbody>
            <tr class="jqgfirstrow" role="row" style="height:auto">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
                <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
            </tr>           
            <tr ...>
            </tr>
        </tbody>
    </table> </div>

I would like to get "600D Polyester" as a result.

My (not working) code snippet is as is:

Sub ParseMaterial()

    Dim Cell As Integer
    Dim ItemNbr As String

    Dim AElement As Object
    Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60

Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody

Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body

For Cell = 1 To 5                            'I iterate through the file row by row

    ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet

    IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
    IE.send

    While IE.ReadyState <> 4
        DoEvents
    Wend

    HTMLBody.innerHTML = IE.responseText

    Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
    For Each AElement In AElements
        If AElement.Title = "Material" Then
            Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column
        End If
    Next AElement

        Application.Wait (Now + TimeValue("0:00:2"))

Next Cell

Thanks for your help !

解决方案

Just a couple things that hopefully will get you in the right direction:

  • clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary

  • target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements

  • make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way

I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.

Sub getInfoWeb()

    Dim cell As Integer
    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSHTML.HTMLDocument
    Dim table As MSHTML.HTMLTable
    Dim tableCells As MSHTML.IHTMLElementCollection

    Set xhr = New MSXML2.XMLHTTP60

    For cell = 1 To 5

        ItemNbr = Cells(cell, 3).Value

        With xhr

            .Open "GET", "http://www.example.com/?item=" & ItemNbr, False
            .send

            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSHTML.HTMLDocument
                doc.body.innerHTML = .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If

        End With

        Set table = doc.getElementById("list-table")
        Set tableCells = table.getElementsByTagName("td")

        For Each tableCell In tableCells
            If tableCell.getAttribute("title") = "Material" Then
                Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
            End If
        Next tableCell

    Next cell

End Sub

EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added

'Determine your product number
    'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
    'text include the "productnummer:" substring, and extract the product number from the outerstring
    'OR
    'if the product number consistently consists of the fctkeywords you are entering in your source url
    'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information

Sub getInfoWeb()

    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSXML2.DOMDocument60
    Dim xmlCell As MSXML2.IXMLDOMElement
    Dim xmlCells As MSXML2.IXMLDOMNodeList
    Dim materialValueElement As MSXML2.IXMLDOMElement

    Set xhr = New MSXML2.XMLHTTP60

        With xhr

            .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
            .send

            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSXML2.DOMDocument60
                doc.LoadXML .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If

        End With

        Set xmlCells = doc.getElementsByTagName("cell")

        For Each xmlCell In xmlCells
            If xmlCell.Text = "Materiaal" Then
                Set materialValueElement = xmlCell.NextSibling
            End If
        Next

        MsgBox materialValueElement.Text

End Sub

EDIT2: an alternative automating IE

Sub searchWebViaIE()
    Dim ie As SHDocVw.InternetExplorer
    Dim doc As MSHTML.HTMLDocument
    Dim anchors As MSHTML.IHTMLElementCollection
    Dim anchor As MSHTML.HTMLAnchorElement
    Dim prodSpec As MSHTML.HTMLAnchorElement
    Dim tableCells As MSHTML.IHTMLElementCollection
    Dim materialValueElement As MSHTML.HTMLTableCell
    Dim tableCell As MSHTML.HTMLTableCell

    Set ie = New SHDocVw.InternetExplorer

    With ie
        .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
        .Visible = True

        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop

        Set doc = .document

        Set anchors = doc.getElementsByTagName("a")

        For Each anchor In anchors
            If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                anchor.Click
                Exit For
            End If
        Next anchor

        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop

    End With

    For Each anchor In anchors
        If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
            Set prodSpec = anchor
        End If
    Next anchor

    Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")

    If Not tableCells Is Nothing Then
        For Each tableCell In tableCells
            If tableCell.innerHTML = "Materiaal" Then
                Set materialValueElement = tableCell.NextSibling
            End If
        Next tableCell
    End If

    MsgBox materialValueElement.innerHTML

End Sub

这篇关于解析VBA中的HTML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆