VBA href Crawl浏览器的源代码 [英] VBA href Crawl on Browser's Source Code

查看:84
本文介绍了VBA href Crawl浏览器的源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对我的问题做了更新,因为我更清楚地知道我正在努力解决的技术问题。



A。如果我们从数据机构的网站上搜索结果网址,我们将获得这个

  https://www.sec.gov / cgi-bin / browse-edgar?action = getcompany& CIK = 0000010795& type = 10-K& dateb =& owner = exclude& count = 20 

B。通过在浏览器中输入步骤A的URL,并转到我们在第100行(我使用Google Chrome)看到的源代码,这个迷人的行也是可点击的链接

  href =/ Archives / edgar / data / 10795/000119312513456802 / 0001193125-13-456802-index.htm

该行包含在我们的源代码的代码片段中:

 < tr> 
< td nowrap =nowrap> 10-K< / td>
< td nowrap =nowrap>< a href =/ Archives / edgar / data / 10795/000119312513456802 / 0001193125-13-456802-index.htmid =documentsbutton>& NBSP;文件及LT; / A>&安培; NBSP; < a href =/ cgi-bin / viewer?action = view& amp; amp; cik = 10795& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp;交互式数据< / a>< / td>
< td class =small>年度报告[第13和15(d)条,不是SK项目405]< br /> Acc-no:0001193125-13-456802 (34法)及; NBSP;大小:15 MB / td>
< td> 2013-11-27< / td>
< td nowrap =nowrap>< a href =/ cgi-bin / browse-edgar?action = getcompany& amp; filenum = 001-04802& amp; amp; amp; amp; amp; amp& amp; amp = 20> 001-04802< / a>< br> 131247478< / td>
< / tr>

C。如果我们点击第100行的步骤A的链接,我们进入下一页,步骤A的链接现在成为URL的一部分!所以我们得到的是分配给此URL的新页面:

  https://www.sec.gov/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802 -index.htm 

D。使用相同的方法,我们在182行这行代码

  href =/ Archives / edgar / data /10795/000119312513456802/bdx-20130930.xml

如果我们点击我们得到的 strXMLSite 在下面的宏上。一旦你看看宏并运行它,你会明白,如果我们可以将相关的过程集成到我们的宏中,那么String 可以在运行时填充所需的URL 是一个合理的结论。这是问题的核心。






我们已经激活了宏需要的 Microsoft XML核心服务(MSXML)(Excel - > VBE - >工具 - >参考 - > Microsoft XML,v6.0)。



我们如何使用步骤A中的URL进行VBA抓取通过源代码到现在位于 strXMLSite 的URL通过向过程添加语句?我们需要从工具 - >参考中激活库?你能用这种方法给我看一个代码块吗?这方面的路线是什么?



为了完整性的原因,我可以提供@ user2140261的宏观礼貌

  Sub GetNode()
Dim strXMLSite As String
Dim objXMLHTTP As MSXML2.XMLHTTP
Dim objXMLDoc As MSXML2.DOMDocument
Dim objXMLNodexbrl作为MSXML2.IXMLDOMNode
Dim objXMLNodeDIIRSP作为MSXML2.IXMLDOMNode

设置objXMLHTTP =新的MSXML2.XMLHTTP
设置objXMLDoc =新的MSXML2.DOMDocument

strXMLSite =http://www.sec.gov/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml

objXMLHTTP.OpenPOST,strXMLSite,False
objXMLHTTP.send
objXMLDoc.LoadXML(objXMLHTTP.responseText)

设置objXMLNodexbrl = objXMLDoc.SelectSingleNode(xbrl)

设置objXMLNodeDIIRSP = objXMLNodexbrl.SelectSingleNode( $($)

工作表(Sheet1)。范围(A1)。Value = objXMLNodeDIIRSP.Text
End Sub

感谢您观看我的问题

解决方案

p>添加对Microsoft Internet控件的引用。这将让你到达可以获得各个xml链接的点。

  Sub Tester()

Dim IE As New InternetExplorer
Dim els,el,colDocLinks As New Collection
Dim lnk

IE.Visible = True
Loadpage IEhttps: /www.sec.gov/cgi-bin/browse-edgar? &安培; _
action = getcompany& CIK = 0000010795& type = 10-K& _
& dateb =& owner = exclude& count = 20

'收集页面上的所有文档链接
设置els = IE.Document。 getelementsbytagname(a)
对于每个el在els
如果Trim(el.innerText)=Documents然后
'Debug.Print el.innerText,el.href
colDocLinks.Add el.href
End If
下一个el

'循环通过文档链接,并检查每个页面的xml链接
对于每个lnk在colDocLinks
Loadpage IE,CStr(lnk)
对于每个el在IE.Document.getelementsbytagname(a)
如果el.href像* .xml然后
调试。打印el.innerText,el.href
'与此链接的文档一起使用
结束如果
下一个el
下一个lnk

End Sub

Sub Loadpage(IE As Object,URL As String)
IE.navigate URL
尽管IE.Busy或IE.ReadyState& GT; READYSTATE_COMPLETE
DoEvents
循环
End Sub


I did update on my question since i know more clearly on the technicality i am trying to address.

A. If we take the resulting URL from a search on a data agency's site we get this

    https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000010795&type=10-K&dateb=&owner=exclude&count=20

B. By entering the URL of Step A into a Browser and going to the source code we see at line No. 100 (I use Google Chrome) this charming line which is also a clickable link:

    href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm"

the line is contained in this code snippet of our source code:

    <tr>
<td nowrap="nowrap">10-K</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm" id="documentsbutton">&nbsp;Documents</a>&nbsp; <a href="/cgi-bin/viewer?action=view&amp;cik=10795&amp;accession_number=0001193125-13-456802&amp;xbrl_type=v" id="interactiveDataBtn">&nbsp;Interactive Data</a></td>
<td class="small" >Annual report [Section 13 and 15(d), not S-K Item 405]<br />Acc-no: 0001193125-13-456802&nbsp;(34 Act)&nbsp; Size: 15 MB            </td>
            <td>2013-11-27</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=001-04802&amp;owner=exclude&amp;count=20">001-04802</a><br>131247478         </td>
         </tr>

C. If we click on line 100 the link of step A, we go to the next page and the link of step A now becomes part of the URL! So what we get is a new page assigned to this URL:

https://www.sec.gov/Archives/edgar/data/10795/000119312513456802/0001193125-13-456802-index.htm

D. With use of the same methodology we meet in line No. 182 this line of code

href="/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"

if we click the line we get to the strXMLSite which is on the macro below. Once you take a look at the macro and run it, you will understand that it is a logical conclusion that the String could populated with the desired URL on runtime, if we could integrate a relevant procedure into our macro. That is the nucleus of the question.


We have activated the needed for the macro Microsoft XML Core Services (MSXML) (Excel --> VBE --> Tools --> References --> Microsoft XML, v6.0) needed for the procedure.

How can we make VBA Crawl from the URL which is on Step A through the source-code to the URL that is now on strXMLSite String by adding statements to the procedure? Do we need to activate a library from Tools--> References? Can you show me a code block using such a methodology? What is the line of approach on this point?

For reasons of completeness allow me to provide the macro courtesy of @user2140261

Sub GetNode()
Dim strXMLSite As String
Dim objXMLHTTP As MSXML2.XMLHTTP
Dim objXMLDoc As MSXML2.DOMDocument
Dim objXMLNodexbrl As MSXML2.IXMLDOMNode
Dim objXMLNodeDIIRSP As MSXML2.IXMLDOMNode

Set objXMLHTTP = New MSXML2.XMLHTTP
Set objXMLDoc = New MSXML2.DOMDocument

strXMLSite = "http://www.sec.gov/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml"

objXMLHTTP.Open "POST", strXMLSite, False
objXMLHTTP.send
objXMLDoc.LoadXML (objXMLHTTP.responseText)

Set objXMLNodexbrl = objXMLDoc.SelectSingleNode("xbrl")

Set objXMLNodeDIIRSP = objXMLNodexbrl.SelectSingleNode("us-gaap:DebtInstrumentInterestRateStatedPercentage")

Worksheets("Sheet1").Range("A1").Value = objXMLNodeDIIRSP.Text
End Sub

thank you for watching my question

解决方案

Add a reference to "Microsoft Internet controls". This will get you to the point where you can get the individual xml links.

Sub Tester()

    Dim IE As New InternetExplorer
    Dim els, el, colDocLinks As New Collection
    Dim lnk

    IE.Visible = True
    Loadpage IE, "https://www.sec.gov/cgi-bin/browse-edgar?" & _
                  "action=getcompany&CIK=0000010795&type=10-K" & _
                  "&dateb=&owner=exclude&count=20"

    'collect all the "Document" links on the page
    Set els = IE.Document.getelementsbytagname("a")
    For Each el In els
        If Trim(el.innerText) = "Documents" Then
            'Debug.Print el.innerText, el.href
            colDocLinks.Add el.href
        End If
    Next el

    'loop through the "document" links and check each page for xml links
    For Each lnk In colDocLinks
        Loadpage IE, CStr(lnk)
        For Each el In IE.Document.getelementsbytagname("a")
            If el.href Like "*.xml" Then
                Debug.Print el.innerText, el.href
                'work with the document from this link
            End If
        Next el
    Next lnk

End Sub

Sub Loadpage(IE As Object, URL As String)
    IE.navigate URL
    Do While IE.Busy Or IE.ReadyState <> READYSTATE_COMPLETE
        DoEvents
    Loop
End Sub

这篇关于VBA href Crawl浏览器的源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆