vba,getElementsByClassName,HTMLSource的双引号都没有了 [英] vba, getElementsByClassName, HTMLSource's double quotation marks are gone

查看:242
本文介绍了vba,getElementsByClassName,HTMLSource的双引号都没有了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用vba刮了一些网站的乐趣,我用VBA作为工具。我使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。

  Public Sub XMLhtmlDocumentHTMLSourceScraper()

Dim XMLHTTPReq As Object
Dim htmlDoc As HTMLDocument

Dim postURL As String

postURL =http://foodffs.tumblr.com/archive/ 2015/11

设置XMLHTTPReq =新的MSXML2.XMLHTTP

使用XMLHTTPReq
。打开GET,postURL,False
。发送
结束

设置htmlDoc =新的HTMLDocument
与htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
结束

i = 0

设置varTemp = htmlDoc.getElementsByClassName(post_glass post_micro_glass)

对于每个vr在varTemp
'下一行对于解决这个问题很重要问题* 1
单元格(1,1)= vr.outerHTML
设置varTemp2 = vr.getElementsByTagName(SPAN class = post_date)
单元格(i + 1,3)= varTemp2.​​Item(0).innerText
''下一行发生438Error''''
设置varTemp2 = vr.getElementsByClassName(hover_inner)
单元格(i + 1,4)= varTemp2.​​innerText

i = i + 1

下一个vr
End Sub

我通过* 1
找出了这个问题细胞(1,1)显示下一件事情

 < DIV class =post_glass post_micro_glasstitle => < A class = hover title =href =http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-reallytarget = _blank> 
< DIV class = hover_inner>< SPAN class = post_date> ...............

是的,所有的类标签丢失了。只有第一个函数的类有
我真的不知道为什么会发生这种情况。



//我可以通过getElementsByTagName(span )。但我更喜欢classTag .....

解决方案

getElementsByClassName方法不被认为是自己的一种方法;只有父HTMLDocument。如果要使用它来定位DIV元素中的元素,则需要创建一个包含该特定DIV元素的.outerHtml的子HTMLDocument。

  Public Sub XMLhtmlDocumentHTMLSourceScraper()

Dim xmlHTTPReq As New MSXML2.XMLHTTP
Dim htmlDOC As New HTMLDocument,divSUBDOC As New HTMLDocument
Dim iDIV As Long, iSPN As Long,iEL As Long
Dim postURL As String,nr As Long,i As Long

postURL =http://foodffs.tumblr.com/archive/2015/11

与xmlHTTPReq
。打开GET,postURL,False
。发送
结束

'设置htmlDOC =新的HTMLDocument
与htmlDOC
.body.innerHTML = xmlHTTPReq.responseText
结束

i = 0

与htmlDOC
对于iDIV = 0到.getElementsByClassName(post_glass post_micro_glass)。Length - 1
nr = Sheet1.Cells(Rows.Count,3).End(xlUp).Offset(1,0 ).Row
使用.getElementsByClassName(post_glass post_micro_glass)(iDIV)
'方法1 - 在集合中运行多个
对于iSPN = 0到.getElementsByTagName(span)。长度 - 1
使用.getElementsByTagName(span)(iSPN)
选择案例LCase(.className)
案例post_date
单元格(nr,3)= .innerText
案例post_notes
单元格(nr,4)= .innerText
案例Else
'不做任何
结束选择
结束
下一步iSPN
'方法2 - 创建一个子HTML文档,以方便通过类名称获得els
divSUBDOC.body.innerHTML = .outerHTML'只有这个DIV中的HTML
W ith divSUBDOC
如果CBool​​(.getElementsByClassName(hover_inner)。Length)那么'至少有一个
'使用第一个
单元格(nr,5)= .getElementsByClassName(hover_inner )(0).innerText
结束如果
结束
结束
下一步iDIV
结束

End Sub

虽然其他 .getElementsByXXXX 可以轻松地检索另一个元素中的集合, getElementsByClassName方法需要考虑它认为是HTMLDocument作为一个整体,即使你已经愚弄了它。


I scrape some websites with vba for fun and I use VBA as tool. I use XMLHTTP and HTMLDocument (cause it's more faster than internetExplorer.Application).

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim XMLHTTPReq As Object
    Dim htmlDoc As HTMLDocument

    Dim postURL As String

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

        Set XMLHTTPReq = New MSXML2.XMLHTTP

        With XMLHTTPReq
            .Open "GET", postURL, False
            .Send
        End With

        Set htmlDoc = New HTMLDocument
        With htmlDoc
            .body.innerHTML = XMLHTTPReq.responseText
        End With

        i = 0

        Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

        For Each vr In varTemp
            ''''the next line is important to solve this issue *1
            Cells(1, 1) = vr.outerHTML
            Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
            Cells(i + 1, 3) = varTemp2.Item(0).innerText
            ''''the next line occur 438Error''''
            Set varTemp2 = vr.getElementsByClassName("hover_inner")
            Cells(i + 1, 4) = varTemp2.innerText

            i = i + 1

        Next vr
End Sub

I figure out this problem by *1 cells(1,1) shows me the next things

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............

Yeah all the class tag lost " ". only the first function's class has " " I really don't know why this situation occur.

//Well I could pharse by getElementsByTagName("span"). but I prefer "class" Tag.....

解决方案

The getElementsByClassName method is not considered a method of itself; only of the parent HTMLDocument. If you want to use it to locate elements within a DIV element, you need to create a sub-HTMLDocument comprised of the .outerHtml of that specific DIV element.

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim xmlHTTPReq As New MSXML2.XMLHTTP
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
    Dim iDIV As Long, iSPN As Long, iEL As Long
    Dim postURL As String, nr As Long, i As Long

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

    With xmlHTTPReq
        .Open "GET", postURL, False
        .Send
    End With

    'Set htmlDOC = New HTMLDocument
    With htmlDOC
        .body.innerHTML = xmlHTTPReq.responseText
    End With

    i = 0

    With htmlDOC
        For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
            nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
            With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
                'method 1 - run through multiples in a collection
                For iSPN = 0 To .getElementsByTagName("span").Length - 1
                    With .getElementsByTagName("span")(iSPN)
                        Select Case LCase(.className)
                            Case "post_date"
                                Cells(nr, 3) = .innerText
                            Case "post_notes"
                                Cells(nr, 4) = .innerText
                            Case Else
                                'do nothing
                        End Select
                    End With
                Next iSPN
                'method 2 - create a sub-HTML doc to facilitate getting els by classname
                divSUBDOC.body.innerHTML = .outerHTML  'only the HTML from this DIV
                With divSUBDOC
                    If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
                        'use the first
                        Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
                    End If
                End With
            End With
        Next iDIV
    End With

End Sub

While other .getElementsByXXXX can readily retrieve collections within another element, the getElementsByClassName method needs to consider what it believes to be the HTMLDocument as a whole, even if you have fooled it into thinking that.

这篇关于vba,getElementsByClassName,HTMLSource的双引号都没有了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆