VBA解析dom找到一个特定的href值 [英] VBA parse dom to find one particular href value

查看:201
本文介绍了VBA解析dom找到一个特定的href值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Excel VBA,由于URL的所有法语版本都在.xls文件中,我必须从同一网站的英文版本中找到约400个URL。



知道网站的dom结构,我知道我可以:




  • 打开网页(MSXML2.XMLHTTP)

  • 搜索网页标题中的一个特定链接。提供用户切换语言的能力的链接。在该链接(href)下,我可以找到英文链接,这就是我需要的。

  • 之后,我将设法将该结果复制到相应的单元格中我的工作表。



结构是这样的。 英文链接发生变化,但始终在此结构下:

 < ul class =global-links> 
< li>< a title =Nous joindrehref =/ fr / coordonnees.html> Nous Joindre< / a>< / li>
< li> |< / li>
< li>< a title =Carrièreshref =/ fr / carrieres.html>Carrières< / a>< / li>
< li> |< / li>
< li>< a title =Englishhref =/ en / personal.html>英文< / a>< / li>
< / ul>

我想要的href是链接上的标题英文 / p>

我被卡住是我知道有两种方法来找到相关文本




  • getElement ...(直接解析DOM)

  • inStr(string manipulation)



我设法测试了这两个,但是:




  • 操作DOM:我会以为以下有工作,但根本没有,它给我一个438错误。而且,我不明白是否有可能将href定位为英文,因为它的标题(因为没有特殊的类或者id)



    .getElementsByClassName(global-links)。innertext


  • 所以我改变了instr方法(操纵位置后做一个MID)我搜索



    InStr(1,htm.body.innerHTML,title =英文href =)




由于双引号,我无法以良好的方式构造要搜索的字符串,我尝试加倍双引号,我也试过像这样的chr(34)方法

 title =& Chr(34)&English&Chr (34)&href =& Chr(34)

但是我不能它可以工作,它没有找到我的字符串。



所以我需要帮助找到具有英语的标题的链接的href值,通过DOM搜索或字符串搜索。



最后,由于它是一个循环,是创建对象/内存使用的最佳做法吗?如何处理对象/连接的关闭创建一个实例,以释放内存或不重载?



任何帮助将不胜感激。感谢提前。



编辑



可以在这里找到一个例子: https://www.bnc.ca/fr/particuliers.html



编辑开始代码



  Sub testAlias()
'title =Englishhref =https ://www.nbc.ca
Dim htm As HTMLDocument,table As Object
Set htm = New HTMLDocument

使用CreateObject(MSXML2.XMLHTTP)
。打开GET,https://www.bnc.ca/fr/particuliers.html,False
.send
htm.body.innerHTML = .responseText

'代码继续这里...
'假设我需要定位< a>与英语的标题和检索其href值
结束与
End Sub



编辑 - 循环不起作用的代码



基于David的答案是,我在工作表上的一个单元格中创建了一个循环,我需要附加值和我的域名。



该域名为 https://www.bnc.ca ,这里是值I测试不起作用:




  • / en / particuliers / cartes-de-credit / cartes-de-credit-mastercard / avec- plan-recompenses / allure.html

  • /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/or-ovation.html



我在foreach行上有一个错误,说明#91错误(未定义的变量/对象或与bloc ...)

 对于每个e元素(0).ChildNodes 

有人可以帮助我解决这个问题吗?

  Sub testAlias()


对于rid = 2到3
'Dim sh As Worksheet
'设置sh = ActiveSheet



Dim sh As Worksheet
Set sh = ActiveSheet

Dim url As String
url =https://www.bnc.ca& shCells(rid,1)

'title =Englishhref =https://www.nbc.ca
Dim http As MSXML2.XMLHTTP
Dim HTMLDoc作为MSHTML.HTMLDocument
Dim DOM As Object'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String

设置http = CreateObject(MSXML2.XMLHTTP)
设置DOM = CreateObject(MSXML2.DOMDocument)
设置HTMLDoc =新的MSHTML.HTMLDocument由于某种原因,我无法使用CreateObject来执行此操作

'##创建HTTPRequest
使用http
.OpenGET,url,False
.send
'##加载XML到DOM
respText = .responseText

结束

'##放入HTML文档
HTMLDoc.body.innerHTML = respText
Debug.Print respText

'##解析DOM
设置元素= HTMLDoc.getElementsByClassName(全局链接)

'如果元素不是没有,那么


'##假设只有一个类名称全局链接
对于每个e In元素(0).ChildNodes
如果e.innerText =English然后
'##显示url:
sh.Cells(rid,2).Formula =
sh.Cells(rid,2).Formula = e.ChildNodes(0).href
End If
Next
DoEvents
'End If



Next rid
Application.ScreenUpdating = True
End Sub


解决方案

你可以用这样的东西强加它,否则我可能会尝试使用XPath或更强大的DOM解析应用(需要查看更多的XML结构来协助):

  Sub foo()
Dim xmlString As String

xmlString =< ul class =global-links>& _
< li>< a title =Nous joindrehref =/ fr / coordonnees.html> Nous Joindre< / a>< / li> &安培; _
< li> |< / li> &安培; _
< li>< a title =Carrièreshref =/ fr / carrieres.html>Carrières< / a>< / li> &安培; _
< li> |< / li> &安培; _
< li>< a title =英文href =/ en / personal.html>英文< / a>< / li> &安培; _
< / ul>

Dim DOM As Object
设置DOM = CreateObject(MSXML2.DOMDOCUMENT)
DOM.LoadXML xmlString


Dim elements
Dim e
设置元素= DOM.DocumentElement.GetElementsByTagName(a)

对于每个e元素
错误恢复Next
如果e。 ParentNode.ParentNode.XML Like< ul class =global-links> *Then
如果e.XML喜欢< a title =英文href = *然后
MsgBox e.XML
End If
End If
Next


End Sub

更新



我无法使用DOM解析错误,当试图加载 HTML.responseText 所以我回到使用HTMLDocument对象代替:

  Sub testAlias()
'title =Englishhref =https://www.nbc.ca
Dim HTTP As MSXML2。 XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object'MSXM L2.DOMDocument
Dim元素作为对象
Dim ele As Object
Dim respText As String

设置HTTP = CreateObject(MSXML2.XMLHTTP)
设置DOM = CreateObject(MSXML2.DOMDocument)
设置HTMLDoc =新的MSHTML.HTMLDocument'由于某些原因,我不能使用CreateObject来执行此操作

'##创建HTTPRequest
使用HTTP
。打开GET,https://www.bnc.ca/fr/particuliers.html,False
.send
'##将XML加载到DOM
respText = .responseText

结束

'##放入HTML文档
HTMLDoc.body.innerHTML = respText

'##我尝试加载DOM,但它不起作用:
'DOM.LoadXML respText
'如果DOM.parseError然后
'MsgBox DOM.parseError .reason
'Stop
'End If

'##解析DOM
设置元素= HTMLDoc.getElementsByClassName(global-links)

'##假设只有一个类名全局链接
对于每个e元素(0)。 ChildNodes
如果e.innerText =English然后
'##显示url:
MsgBox e.ChildNodes(0).href
结束If
下一个


End Sub


Using Excel VBA, since all the french versions of the URL are in a .xls file, I have to find about 400 URLs from the english version of the same site.

Knowing the dom structure of the site, I know that I can:

  • Open the webpage (MSXML2.XMLHTTP)
  • Search for one particular link in the header of the webpage. The link that gives the ability for a user to switch language. Under that link (href) I'll be able to find the english link and that is all I need.
  • After that, I'll manage to copy that result in the corresponding cell in my worksheet.

The structure is like so. The "English" link changes but always under this structure:

<ul class="global-links">
<li><a title="Nous joindre" href="/fr/coordonnees.html">Nous Joindre</a></li>
<li>|</li>
<li><a title="Carrières" href="/fr/carrieres.html">Carrières</a></li>
<li>|</li>        
<li><a title="English" href="/en/personal.html">English</a></li>
</ul>

The href I want is the one on the link which has the title "English" on it.

Were I'm stuck is that I know that there is two ways to find the relevant text

  • getElement... (directly parsing the DOM)
  • inStr (string manipulation)

I managed to test both of them but:

  • Manipulating the DOM: I would have thought that the following would have work but not at all and it gives me a 438 error. And also, I don't understand if there would be a possibility to target a href that as "English" for its title (since there is no special class or id on it)

    .getElementsByClassName("global-links").innertext

  • So I changed for the instr method (manipulating the positions to do a MID afterward... I search for

    InStr(1, htm.body.innerHTML, "title=""English"" href=")

I can't construct the string I want to search for in the good manner because of the double quotes. I tried doubling the double quotes. I tried also the chr(34) method like so

"title=" & Chr(34) & "English" & Chr(34) & " href=" & Chr(34)

But I can't make it work either, it doesn't find my string.

So I need help to find the href value of the link which has "English" for its title, either via DOM search or string search.

Finally, since it's for a loop, is there a best practice for the creation object/memory use? How needs to be handled the closing of the object/connexion create for one instance in order to free memory or not overloading it?

Any help would be greatly appreciated. Thanks in advance.

EDIT

An example can be found here: https://www.bnc.ca/fr/particuliers.html

EDIT to give starting code

Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim htm As HTMLDocument, table As Object
Set htm = New HTMLDocument

With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
    .send
    htm.body.innerHTML = .responseText

    'Code to continue here...
    'Assuming I need to target the <a> with "English" for title and retrieve its href value
End With
End Sub

EDIT - Code with loop not working

Based on David's answer, I created a loop from values in one cell on my worksheet. I need to append the value and my domain name.

The domain is https://www.bnc.ca and here are values I test that doesn't work:

  • /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/allure.html
  • /en/particuliers/cartes-de-credit/cartes-de-credit-mastercard/avec-plan-recompenses/or-ovation.html

I have an error on the foreach line stating the #91 error (undefined variable/object or with bloc...)

For Each e In elements(0).ChildNodes

Can somebody help me on how to resolve that part please?

Sub testAlias()


For rid = 2 To 3
'Dim sh As Worksheet
'Set sh = ActiveSheet



Dim sh As Worksheet
Set sh = ActiveSheet

Dim url As String
url = "https://www.bnc.ca" & sh.Cells(rid, 1)

'title="English" href="https://www.nbc.ca
Dim http As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String

Set http = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this

'## Create the HTTPRequest
With http
    .Open "GET", url, False
    .send ""
    '## Load the XML to DOM
    respText = .responseText

End With

'## Put in HTML Document
HTMLDoc.body.innerHTML = respText
Debug.Print respText

'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")

'If elements Is Not Nothing Then


    '## Assume there is only one class name "global-links"
    For Each e In elements(0).ChildNodes
        If e.innerText = "English" Then
            '## Display the url:
            sh.Cells(rid, 2).Formula = ""
            sh.Cells(rid, 2).Formula = e.ChildNodes(0).href
        End If
    Next
    DoEvents
'End If



Next rid
Application.ScreenUpdating = True
End Sub

解决方案

You can brute-force it with something like this, otherwise I would probably try to use XPath or a more robust application of the DOM parsing (would need to see more of the XML structure to assist with that):

Sub foo()
Dim xmlString As String

xmlString = "<ul class=""global-links"">" & _
"<li><a title=""Nous joindre"" href=""/fr/coordonnees.html"">Nous Joindre</a></li>" & _
"<li>|</li>" & _
"<li><a title=""Carrières"" href=""/fr/carrieres.html"">Carrières</a></li>" & _
"<li>|</li>" & _
"<li><a title=""English"" href=""/en/personal.html"">English</a></li>" & _
"</ul>"

Dim DOM As Object
Set DOM = CreateObject("MSXML2.DOMDOCUMENT")
DOM.LoadXML xmlString


Dim elements
Dim e
Set elements = DOM.DocumentElement.GetElementsByTagName("a")

For Each e In elements
    On Error Resume Next
    If e.ParentNode.ParentNode.XML Like "<ul class=""global-links"">*" Then
        If e.XML Like "<a title=""English"" href=*" Then
            MsgBox e.XML
        End If
    End If
Next


End Sub

Update

I was unable to use DOM (kept getting a parse error when trying to load the HTML.responseText so I went back to using the HTMLDocument object instead:

Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim HTTP As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String

Set HTTP = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this

'## Create the HTTPRequest
With HTTP
    .Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
    .send ""
    '## Load the XML to DOM
    respText = .responseText

End With

'## Put in HTML Document
HTMLDoc.body.innerHTML = respText

'## I tried loading in to DOM but it would not work:
'DOM.LoadXML respText
'If DOM.parseError Then
'    MsgBox DOM.parseError.reason
'    Stop
'End If

'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")

'## Assume there is only one class name "global-links"
For Each e In elements(0).ChildNodes
    If e.innerText = "English" Then
        '## Display the url:
        MsgBox e.ChildNodes(0).href
    End If
Next


End Sub

这篇关于VBA解析dom找到一个特定的href值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆