如何在不创建vba中的Internet Explorer对象的情况下解析html? [英] How do I parse html without creating an object of internet explorer in vba?

查看:148
本文介绍了如何在不创建vba中的Internet Explorer对象的情况下解析html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在任何工作的计算机上没有Internet Explorer,因此创建Internet Explorer对象并使用ie.navigate解析html并搜索标记是不可能的。我的问题是,如何在不使用IE的情况下自动将框架源代码中的特定数据与我的电子表格拉到一起?答案中的代码示例非常有用:)谢谢

I don't have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags isn't possible. My question is, how can I pull certain data with a tag automatically from a frame source to my spreadsheet without using IE? Example of code in answers would be very useful :) Thanks

推荐答案

您可以使用 XMLHTTP 来检索网页的HTML源代码:

You could use XMLHTTP to retrieve the HTML source of a web page:

Function GetHTML(url As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .Send
        GetHTML = .ResponseText
    End With
End Function

我不会建议将它用作工作表函数,否则每次重新计算工作表时都会重新查询站点URL。一些网站有逻辑检测通过频繁的重复调用进行搜索,并且您的IP可能会暂时或永久地被禁止,具体取决于网站。

I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.

一旦获得源HTML字符串(最好存储在变量中以避免不必要的重复调用),您可以使用基本的文本函数来解析字符串以搜索您的标记。

Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.

这个基本函数将返回 < tag> < /标签>

This basic function will return the value between the <tag> and </tag>:

Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
    Dim html As String, pStart As Long, pEnd As Long, o As Integer
    html = GetHTML(url)

    'remove <> if they exist so we can add our own
    If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
        tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
    End If

    ' default to Occurrence #1
    If occurNum = 0 Then occurNum = 1
    pEnd = 1

    For o = 1 To occurNum
        ' find start <tag> beginning at 1 (or after previous Occurence)
        pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
        If pStart = 0 Then
            getTag = "{Not Found}"
            Exit Function
        End If
        pStart = pStart + Len("<" & tag & ">")

        ' find first end </tag> after start <tag>
        pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
    Next o

    'return string between start <tag> & end </tag>
    getTag = Mid(html, pStart, pEnd - pStart)
End Function

这只会找到基本的 < tag> ,但您可以添加/删除/更改文本功能以满足您的需求。

示例用法:

This will find only basic <tag>'s but you could add/remove/change the text functions to suit your needs.

Sub findTagExample()

    Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"

    'search for 2nd occurence of tag: <h2> which is "Contents" :
    Debug.Print getTag(testURL, "<h2>", 2)

    '...this returns the 8th occurence, "Navigation Menu" :
    Debug.Print getTag(testURL, "<h2>", 8)

    '...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
    Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)

End Sub

这篇关于如何在不创建vba中的Internet Explorer对象的情况下解析html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆