浏览器源代码上的 VBA href 抓取 [英] VBA href Crawl on Browser's Source Code

查看:23
本文介绍了浏览器源代码上的 VBA href 抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我确实更新了我的问题,因为我更清楚地了解我要解决的技术问题.

I did update on my question since i know more clearly on the technicality i am trying to address.

A.如果我们从数据机构网站上的搜索中获取结果 URL,我们会得到这个

A. If we take the resulting URL from a search on a data agency's site we get this

    https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000010795&type=10-K&dateb=&owner=exclude&count=20

B.通过在浏览器中输入步骤 A 的 URL 并转到我们在第 100 行(我使用谷歌浏览器)看到的源代码,这个迷人的行它也是一个可点击的链接:

B. By entering the URL of Step A into a Browser and going to the source code we see at line No. 100 (I use Google Chrome) this charming line which is also a clickable link:

    href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm"

该行包含在我们源代码的代码片段中:

the line is contained in this code snippet of our source code:

    <tr>
<td nowrap="nowrap">10-K</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm" id="documentsbutton">&nbsp;Documents</a>&nbsp; <a href="/cgi-bin/viewer?action=view&amp;cik=10795&amp;accession_number=0001193125-13-456802&amp;xbrl_type=v" id="interactiveDataBtn">&nbsp;Interactive Data</a></td>
<td class="small" >Annual report [Section 13 and 15(d), not S-K Item 405]<br />Acc-no: 0001193125-13-456802&nbsp;(34 Act)&nbsp; Size: 15 MB            </td>
            <td>2013-11-27</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=001-04802&amp;owner=exclude&amp;count=20">001-04802</a><br>131247478         </td>
         </tr>

C.如果我们点击第 100 行步骤 A 的链接,我们将转到下一页步骤 A 的链接现在成为 URL 的一部分! 所以我们得到的是分配给该 URL 的新页面:

C. If we click on line 100 the link of step A, we go to the next page and the link of step A now becomes part of the URL! So what we get is a new page assigned to this URL:

https://www.sec.gov/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm

D.使用相同的方法,我们在第 182 行遇到这行代码

D. With use of the same methodology we meet in line No. 182 this line of code

href="/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"

如果我们单击该行,我们将到达下面宏中的 strXMLSite.一旦你查看了宏并运行它,你就会明白,如果我们可以将相关的过程集成到我们的宏中,那么字符串可以在运行时用所需的 URL 填充是一个合乎逻辑的结论.这是问题的核心.

if we click the line we get to the strXMLSite which is on the macro below. Once you take a look at the macro and run it, you will understand that it is a logical conclusion that the String could populated with the desired URL on runtime, if we could integrate a relevant procedure into our macro. That is the nucleus of the question.

我们已激活宏所需的程序所需的 Microsoft XML 核心服务 (MSXML)(Excel --> VBE --> 工具 --> 参考 --> Microsoft XML,v6.0).

We have activated the needed for the macro Microsoft XML Core Services (MSXML) (Excel --> VBE --> Tools --> References --> Microsoft XML, v6.0) needed for the procedure.

我们如何通过将语句从位于步骤 A 的 URL 通过源代码 到现在位于 strXMLSite 字符串的 URL 中添加语句来使 VBA 爬网程序?我们是否需要从工具--> 参考激活库?你能用这种方法给我看一个代码块吗?在这一点上的做法是什么?

How can we make VBA Crawl from the URL which is on Step A through the source-code to the URL that is now on strXMLSite String by adding statements to the procedure? Do we need to activate a library from Tools--> References? Can you show me a code block using such a methodology? What is the line of approach on this point?

出于完整性考虑,允许我提供@user2140261 的宏礼貌

For reasons of completeness allow me to provide the macro courtesy of @user2140261

Sub GetNode()
Dim strXMLSite As String
Dim objXMLHTTP As MSXML2.XMLHTTP
Dim objXMLDoc As MSXML2.DOMDocument
Dim objXMLNodexbrl As MSXML2.IXMLDOMNode
Dim objXMLNodeDIIRSP As MSXML2.IXMLDOMNode

Set objXMLHTTP = New MSXML2.XMLHTTP
Set objXMLDoc = New MSXML2.DOMDocument

strXMLSite = "http://www.sec.gov/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"

objXMLHTTP.Open "POST", strXMLSite, False
objXMLHTTP.send
objXMLDoc.LoadXML (objXMLHTTP.responseText)

Set objXMLNodexbrl = objXMLDoc.SelectSingleNode("xbrl")

Set objXMLNodeDIIRSP = objXMLNodexbrl.SelectSingleNode("us-gaap:DebtInstrumentInterestRateStatedPercentage")

Worksheets("Sheet1").Range("A1").Value = objXMLNodeDIIRSP.Text
End Sub

感谢您观看我的问题

推荐答案

添加对Microsoft Internet 控件"的引用.这将使您能够获得单个 xml 链接.

Add a reference to "Microsoft Internet controls". This will get you to the point where you can get the individual xml links.

Sub Tester()

    Dim IE As New InternetExplorer
    Dim els, el, colDocLinks As New Collection
    Dim lnk

    IE.Visible = True
    Loadpage IE, "https://www.sec.gov/cgi-bin/browse-edgar?" & _
                  "action=getcompany&CIK=0000010795&type=10-K" & _
                  "&dateb=&owner=exclude&count=20"

    'collect all the "Document" links on the page
    Set els = IE.Document.getelementsbytagname("a")
    For Each el In els
        If Trim(el.innerText) = "Documents" Then
            'Debug.Print el.innerText, el.href
            colDocLinks.Add el.href
        End If
    Next el

    'loop through the "document" links and check each page for xml links
    For Each lnk In colDocLinks
        Loadpage IE, CStr(lnk)
        For Each el In IE.Document.getelementsbytagname("a")
            If el.href Like "*.xml" Then
                Debug.Print el.innerText, el.href
                'work with the document from this link
            End If
        Next el
    Next lnk

End Sub

Sub Loadpage(IE As Object, URL As String)
    IE.navigate URL
    Do While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE
        DoEvents
    Loop
End Sub

这篇关于浏览器源代码上的 VBA href 抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆