VBA解析dom找到一个特定的href值 [英] VBA parse dom to find one particular href value
问题描述
使用Excel VBA,由于URL的所有法语版本都在.xls文件中,我必须从同一网站的英文版本中找到约400个URL。
知道网站的dom结构,我知道我可以:
- 打开网页(MSXML2.XMLHTTP)
- 搜索网页标题中的一个特定链接。提供用户切换语言的能力的链接。在该链接(href)下,我可以找到英文链接,这就是我需要的。
- 之后,我将设法将该结果复制到相应的单元格中我的工作表。
结构是这样的。 英文链接发生变化,但始终在此结构下:
< ul class =global-links>
< li>< a title =Nous joindrehref =/ fr / coordonnees.html> Nous Joindre< / a>< / li>
< li> |< / li>
< li>< a title =Carrièreshref =/ fr / carrieres.html>Carrières< / a>< / li>
< li> |< / li>
< li>< a title =Englishhref =/ en / personal.html>英文< / a>< / li>
< / ul>
我想要的href是链接上的标题英文 / p>
我被卡住是我知道有两种方法来找到相关文本
- getElement ...(直接解析DOM)
- inStr(string manipulation)
我设法测试了这两个,但是:
-
操作DOM:我会以为以下有工作,但根本没有,它给我一个438错误。而且,我不明白是否有可能将href定位为英文,因为它的标题(因为没有特殊的类或者id)
.getElementsByClassName(global-links)。innertext
-
所以我改变了instr方法(操纵位置后做一个MID)我搜索
InStr(1,htm.body.innerHTML,title =英文href =)
由于双引号,我无法以良好的方式构造要搜索的字符串,我尝试加倍双引号,我也试过像这样的chr(34)方法
title =& Chr(34)&English&Chr (34)&href =& Chr(34)
但是我不能它可以工作,它没有找到我的字符串。
所以我需要帮助找到具有英语的标题的链接的href值,通过DOM搜索或字符串搜索。
最后,由于它是一个循环,是创建对象/内存使用的最佳做法吗?如何处理对象/连接的关闭创建一个实例,以释放内存或不重载?
任何帮助将不胜感激。感谢提前。
编辑
可以在这里找到一个例子: https://www.bnc.ca/fr/particuliers.html
编辑开始代码
Sub testAlias()
'title =Englishhref =https ://www.nbc.ca
Dim htm As HTMLDocument,table As Object
Set htm = New HTMLDocument
使用CreateObject(MSXML2.XMLHTTP)
。打开GET,https://www.bnc.ca/fr/particuliers.html,False
.send
htm.body.innerHTML = .responseText
'代码继续这里...
'假设我需要定位< a>与英语的标题和检索其href值
结束与
End Sub
编辑 - 循环不起作用的代码
基于David的答案是,我在工作表上的一个单元格中创建了一个循环,我需要附加值和我的域名。
该域名为 https://www.bnc.ca ,这里是值I测试不起作用:
- / en / particuliers / cartes-de-credit / cartes-de-credit-mastercard / avec- plan-recompenses / allure.html
- /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/or-ovation.html
我在foreach行上有一个错误,说明#91错误(未定义的变量/对象或与bloc ...)
对于每个e元素(0).ChildNodes
有人可以帮助我解决这个问题吗?
Sub testAlias()
对于rid = 2到3
'Dim sh As Worksheet
'设置sh = ActiveSheet
Dim sh As Worksheet
Set sh = ActiveSheet
Dim url As String
url =https://www.bnc.ca& shCells(rid,1)
'title =Englishhref =https://www.nbc.ca
Dim http As MSXML2.XMLHTTP
Dim HTMLDoc作为MSHTML.HTMLDocument
Dim DOM As Object'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String
设置http = CreateObject(MSXML2.XMLHTTP)
设置DOM = CreateObject(MSXML2.DOMDocument)
设置HTMLDoc =新的MSHTML.HTMLDocument由于某种原因,我无法使用CreateObject来执行此操作
'##创建HTTPRequest
使用http
.OpenGET,url,False
.send
'##加载XML到DOM
respText = .responseText
结束
'##放入HTML文档
HTMLDoc.body.innerHTML = respText
Debug.Print respText
'##解析DOM
设置元素= HTMLDoc.getElementsByClassName(全局链接)
'如果元素不是没有,那么
'##假设只有一个类名称全局链接
对于每个e In元素(0).ChildNodes
如果e.innerText =English然后
'##显示url:
sh.Cells(rid,2).Formula =
sh.Cells(rid,2).Formula = e.ChildNodes(0).href
End If
Next
DoEvents
'End If
Next rid
Application.ScreenUpdating = True
End Sub
你可以用这样的东西强加它,否则我可能会尝试使用XPath或更强大的DOM解析应用(需要查看更多的XML结构来协助):
Sub foo()
Dim xmlString As String
xmlString =< ul class =global-links>& _
< li>< a title =Nous joindrehref =/ fr / coordonnees.html> Nous Joindre< / a>< / li> &安培; _
< li> |< / li> &安培; _
< li>< a title =Carrièreshref =/ fr / carrieres.html>Carrières< / a>< / li> &安培; _
< li> |< / li> &安培; _
< li>< a title =英文href =/ en / personal.html>英文< / a>< / li> &安培; _
< / ul>
Dim DOM As Object
设置DOM = CreateObject(MSXML2.DOMDOCUMENT)
DOM.LoadXML xmlString
Dim elements
Dim e
设置元素= DOM.DocumentElement.GetElementsByTagName(a)
对于每个e元素
错误恢复Next
如果e。 ParentNode.ParentNode.XML Like< ul class =global-links> *Then
如果e.XML喜欢< a title =英文href = *然后
MsgBox e.XML
End If
End If
Next
End Sub
更新
我无法使用DOM解析错误,当试图加载 HTML.responseText
所以我回到使用HTMLDocument对象代替:
Sub testAlias()
'title =Englishhref =https://www.nbc.ca
Dim HTTP As MSXML2。 XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object'MSXM L2.DOMDocument
Dim元素作为对象
Dim ele As Object
Dim respText As String
设置HTTP = CreateObject(MSXML2.XMLHTTP)
设置DOM = CreateObject(MSXML2.DOMDocument)
设置HTMLDoc =新的MSHTML.HTMLDocument'由于某些原因,我不能使用CreateObject来执行此操作
'##创建HTTPRequest
使用HTTP
。打开GET,https://www.bnc.ca/fr/particuliers.html,False
.send
'##将XML加载到DOM
respText = .responseText
结束
'##放入HTML文档
HTMLDoc.body.innerHTML = respText
'##我尝试加载DOM,但它不起作用:
'DOM.LoadXML respText
'如果DOM.parseError然后
'MsgBox DOM.parseError .reason
'Stop
'End If
'##解析DOM
设置元素= HTMLDoc.getElementsByClassName(global-links)
'##假设只有一个类名全局链接
对于每个e元素(0)。 ChildNodes
如果e.innerText =English然后
'##显示url:
MsgBox e.ChildNodes(0).href
结束If
下一个
End Sub
Using Excel VBA, since all the french versions of the URL are in a .xls file, I have to find about 400 URLs from the english version of the same site.
Knowing the dom structure of the site, I know that I can:
- Open the webpage (MSXML2.XMLHTTP)
- Search for one particular link in the header of the webpage. The link that gives the ability for a user to switch language. Under that link (href) I'll be able to find the english link and that is all I need.
- After that, I'll manage to copy that result in the corresponding cell in my worksheet.
The structure is like so. The "English" link changes but always under this structure:
<ul class="global-links">
<li><a title="Nous joindre" href="/fr/coordonnees.html">Nous Joindre</a></li>
<li>|</li>
<li><a title="Carrières" href="/fr/carrieres.html">Carrières</a></li>
<li>|</li>
<li><a title="English" href="/en/personal.html">English</a></li>
</ul>
The href I want is the one on the link which has the title "English" on it.
Were I'm stuck is that I know that there is two ways to find the relevant text
- getElement... (directly parsing the DOM)
- inStr (string manipulation)
I managed to test both of them but:
Manipulating the DOM: I would have thought that the following would have work but not at all and it gives me a 438 error. And also, I don't understand if there would be a possibility to target a href that as "English" for its title (since there is no special class or id on it)
.getElementsByClassName("global-links").innertext
So I changed for the instr method (manipulating the positions to do a MID afterward... I search for
InStr(1, htm.body.innerHTML, "title=""English"" href=")
I can't construct the string I want to search for in the good manner because of the double quotes. I tried doubling the double quotes. I tried also the chr(34) method like so
"title=" & Chr(34) & "English" & Chr(34) & " href=" & Chr(34)
But I can't make it work either, it doesn't find my string.
So I need help to find the href value of the link which has "English" for its title, either via DOM search or string search.
Finally, since it's for a loop, is there a best practice for the creation object/memory use? How needs to be handled the closing of the object/connexion create for one instance in order to free memory or not overloading it?
Any help would be greatly appreciated. Thanks in advance.
EDIT
An example can be found here: https://www.bnc.ca/fr/particuliers.html
EDIT to give starting code
Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim htm As HTMLDocument, table As Object
Set htm = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
.send
htm.body.innerHTML = .responseText
'Code to continue here...
'Assuming I need to target the <a> with "English" for title and retrieve its href value
End With
End Sub
EDIT - Code with loop not working
Based on David's answer, I created a loop from values in one cell on my worksheet. I need to append the value and my domain name.
The domain is https://www.bnc.ca and here are values I test that doesn't work:
- /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/allure.html
- /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/or-ovation.html
I have an error on the foreach line stating the #91 error (undefined variable/object or with bloc...)
For Each e In elements(0).ChildNodes
Can somebody help me on how to resolve that part please?
Sub testAlias()
For rid = 2 To 3
'Dim sh As Worksheet
'Set sh = ActiveSheet
Dim sh As Worksheet
Set sh = ActiveSheet
Dim url As String
url = "https://www.bnc.ca" & sh.Cells(rid, 1)
'title="English" href="https://www.nbc.ca
Dim http As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String
Set http = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this
'## Create the HTTPRequest
With http
.Open "GET", url, False
.send ""
'## Load the XML to DOM
respText = .responseText
End With
'## Put in HTML Document
HTMLDoc.body.innerHTML = respText
Debug.Print respText
'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")
'If elements Is Not Nothing Then
'## Assume there is only one class name "global-links"
For Each e In elements(0).ChildNodes
If e.innerText = "English" Then
'## Display the url:
sh.Cells(rid, 2).Formula = ""
sh.Cells(rid, 2).Formula = e.ChildNodes(0).href
End If
Next
DoEvents
'End If
Next rid
Application.ScreenUpdating = True
End Sub
You can brute-force it with something like this, otherwise I would probably try to use XPath or a more robust application of the DOM parsing (would need to see more of the XML structure to assist with that):
Sub foo()
Dim xmlString As String
xmlString = "<ul class=""global-links"">" & _
"<li><a title=""Nous joindre"" href=""/fr/coordonnees.html"">Nous Joindre</a></li>" & _
"<li>|</li>" & _
"<li><a title=""Carrières"" href=""/fr/carrieres.html"">Carrières</a></li>" & _
"<li>|</li>" & _
"<li><a title=""English"" href=""/en/personal.html"">English</a></li>" & _
"</ul>"
Dim DOM As Object
Set DOM = CreateObject("MSXML2.DOMDOCUMENT")
DOM.LoadXML xmlString
Dim elements
Dim e
Set elements = DOM.DocumentElement.GetElementsByTagName("a")
For Each e In elements
On Error Resume Next
If e.ParentNode.ParentNode.XML Like "<ul class=""global-links"">*" Then
If e.XML Like "<a title=""English"" href=*" Then
MsgBox e.XML
End If
End If
Next
End Sub
Update
I was unable to use DOM (kept getting a parse error when trying to load the HTML.responseText
so I went back to using the HTMLDocument object instead:
Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim HTTP As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String
Set HTTP = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this
'## Create the HTTPRequest
With HTTP
.Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
.send ""
'## Load the XML to DOM
respText = .responseText
End With
'## Put in HTML Document
HTMLDoc.body.innerHTML = respText
'## I tried loading in to DOM but it would not work:
'DOM.LoadXML respText
'If DOM.parseError Then
' MsgBox DOM.parseError.reason
' Stop
'End If
'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")
'## Assume there is only one class name "global-links"
For Each e In elements(0).ChildNodes
If e.innerText = "English" Then
'## Display the url:
MsgBox e.ChildNodes(0).href
End If
Next
End Sub
这篇关于VBA解析dom找到一个特定的href值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!