浏览器源代码上的 VBA href 抓取 [英] VBA href Crawl on Browser's Source Code
问题描述
我确实更新了我的问题,因为我更清楚地了解我要解决的技术问题.
I did update on my question since i know more clearly on the technicality i am trying to address.
A.如果我们从数据机构网站上的搜索中获取结果 URL,我们会得到这个
A. If we take the resulting URL from a search on a data agency's site we get this
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000010795&type=10-K&dateb=&owner=exclude&count=20
B.通过在浏览器中输入步骤 A 的 URL 并转到我们在第 100 行(我使用谷歌浏览器)看到的源代码,这个迷人的行它也是一个可点击的链接:
B. By entering the URL of Step A into a Browser and going to the source code we see at line No. 100 (I use Google Chrome) this charming line which is also a clickable link:
href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm"
该行包含在我们源代码的代码片段中:
the line is contained in this code snippet of our source code:
<tr>
<td nowrap="nowrap">10-K</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm" id="documentsbutton"> Documents</a> <a href="/cgi-bin/viewer?action=view&cik=10795&accession_number=0001193125-13-456802&xbrl_type=v" id="interactiveDataBtn"> Interactive Data</a></td>
<td class="small" >Annual report [Section 13 and 15(d), not S-K Item 405]<br />Acc-no: 0001193125-13-456802 (34 Act) Size: 15 MB </td>
<td>2013-11-27</td>
<td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&filenum=001-04802&owner=exclude&count=20">001-04802</a><br>131247478 </td>
</tr>
C.如果我们点击第 100 行步骤 A 的链接,我们将转到下一页步骤 A 的链接现在成为 URL 的一部分! 所以我们得到的是分配给该 URL 的新页面:
C. If we click on line 100 the link of step A, we go to the next page and the link of step A now becomes part of the URL! So what we get is a new page assigned to this URL:
https://www.sec.gov/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm
D.使用相同的方法,我们在第 182 行遇到这行代码
D. With use of the same methodology we meet in line No. 182 this line of code
href="/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"
如果我们单击该行,我们将到达下面宏中的 strXMLSite
.一旦你查看了宏并运行它,你就会明白,如果我们可以将相关的过程集成到我们的宏中,那么字符串可以在运行时用所需的 URL 填充是一个合乎逻辑的结论.这是问题的核心.
if we click the line we get to the strXMLSite
which is on the macro below. Once you take a look at the macro and run it, you will understand that it is a logical conclusion that the String could populated with the desired URL on runtime, if we could integrate a relevant procedure into our macro. That is the nucleus of the question.
我们已激活宏所需的程序所需的 Microsoft XML 核心服务 (MSXML)(Excel --> VBE --> 工具 --> 参考 --> Microsoft XML,v6.0).
We have activated the needed for the macro Microsoft XML Core Services (MSXML) (Excel --> VBE --> Tools --> References --> Microsoft XML, v6.0) needed for the procedure.
我们如何通过将语句从位于步骤 A 的 URL 通过源代码 到现在位于 strXMLSite
字符串的 URL 中添加语句来使 VBA 爬网程序?我们是否需要从工具--> 参考激活库?你能用这种方法给我看一个代码块吗?在这一点上的做法是什么?
How can we make VBA Crawl from the URL which is on Step A through the source-code to the URL that is now on strXMLSite
String by adding statements to the procedure? Do we need to activate a library from Tools--> References? Can you show me a code block using such a methodology? What is the line of approach on this point?
出于完整性考虑,允许我提供@user2140261 的宏礼貌
For reasons of completeness allow me to provide the macro courtesy of @user2140261
Sub GetNode()
Dim strXMLSite As String
Dim objXMLHTTP As MSXML2.XMLHTTP
Dim objXMLDoc As MSXML2.DOMDocument
Dim objXMLNodexbrl As MSXML2.IXMLDOMNode
Dim objXMLNodeDIIRSP As MSXML2.IXMLDOMNode
Set objXMLHTTP = New MSXML2.XMLHTTP
Set objXMLDoc = New MSXML2.DOMDocument
strXMLSite = "http://www.sec.gov/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"
objXMLHTTP.Open "POST", strXMLSite, False
objXMLHTTP.send
objXMLDoc.LoadXML (objXMLHTTP.responseText)
Set objXMLNodexbrl = objXMLDoc.SelectSingleNode("xbrl")
Set objXMLNodeDIIRSP = objXMLNodexbrl.SelectSingleNode("us-gaap:DebtInstrumentInterestRateStatedPercentage")
Worksheets("Sheet1").Range("A1").Value = objXMLNodeDIIRSP.Text
End Sub
感谢您观看我的问题
推荐答案
添加对Microsoft Internet 控件"的引用.这将使您能够获得单个 xml 链接.
Add a reference to "Microsoft Internet controls". This will get you to the point where you can get the individual xml links.
Sub Tester()
Dim IE As New InternetExplorer
Dim els, el, colDocLinks As New Collection
Dim lnk
IE.Visible = True
Loadpage IE, "https://www.sec.gov/cgi-bin/browse-edgar?" & _
"action=getcompany&CIK=0000010795&type=10-K" & _
"&dateb=&owner=exclude&count=20"
'collect all the "Document" links on the page
Set els = IE.Document.getelementsbytagname("a")
For Each el In els
If Trim(el.innerText) = "Documents" Then
'Debug.Print el.innerText, el.href
colDocLinks.Add el.href
End If
Next el
'loop through the "document" links and check each page for xml links
For Each lnk In colDocLinks
Loadpage IE, CStr(lnk)
For Each el In IE.Document.getelementsbytagname("a")
If el.href Like "*.xml" Then
Debug.Print el.innerText, el.href
'work with the document from this link
End If
Next el
Next lnk
End Sub
Sub Loadpage(IE As Object, URL As String)
IE.navigate URL
Do While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
End Sub
这篇关于浏览器源代码上的 VBA href 抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!