如何在不创建vba中的Internet Explorer对象的情况下解析html? [英] How do I parse html without creating an object of internet explorer in vba?
问题描述
在任何工作的计算机上没有Internet Explorer,因此创建Internet Explorer对象并使用ie.navigate解析html并搜索标记是不可能的。我的问题是,如何在不使用IE的情况下自动将框架源代码中的特定数据与我的电子表格拉到一起?答案中的代码示例非常有用:)谢谢
I don't have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags isn't possible. My question is, how can I pull certain data with a tag automatically from a frame source to my spreadsheet without using IE? Example of code in answers would be very useful :) Thanks
推荐答案
您可以使用 XMLHTTP 来检索网页的HTML源代码:
You could use XMLHTTP to retrieve the HTML source of a web page:
Function GetHTML(url As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.Send
GetHTML = .ResponseText
End With
End Function
我不会建议将它用作工作表函数,否则每次重新计算工作表时都会重新查询站点URL。一些网站有逻辑检测通过频繁的重复调用进行搜索,并且您的IP可能会暂时或永久地被禁止,具体取决于网站。
I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.
一旦获得源HTML字符串(最好存储在变量中以避免不必要的重复调用),您可以使用基本的文本函数来解析字符串以搜索您的标记。
Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.
这个基本函数将返回 < tag>
和 < /标签>
:
This basic function will return the value between the <tag>
and </tag>
:
Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
Dim html As String, pStart As Long, pEnd As Long, o As Integer
html = GetHTML(url)
'remove <> if they exist so we can add our own
If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
End If
' default to Occurrence #1
If occurNum = 0 Then occurNum = 1
pEnd = 1
For o = 1 To occurNum
' find start <tag> beginning at 1 (or after previous Occurence)
pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
If pStart = 0 Then
getTag = "{Not Found}"
Exit Function
End If
pStart = pStart + Len("<" & tag & ">")
' find first end </tag> after start <tag>
pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
Next o
'return string between start <tag> & end </tag>
getTag = Mid(html, pStart, pEnd - pStart)
End Function
这只会找到基本的 < tag>
,但您可以添加/删除/更改文本功能以满足您的需求。
示例用法:
This will find only basic <tag>
's but you could add/remove/change the text functions to suit your needs.
Sub findTagExample()
Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
'search for 2nd occurence of tag: <h2> which is "Contents" :
Debug.Print getTag(testURL, "<h2>", 2)
'...this returns the 8th occurence, "Navigation Menu" :
Debug.Print getTag(testURL, "<h2>", 8)
'...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)
End Sub
这篇关于如何在不创建vba中的Internet Explorer对象的情况下解析html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!