Excel VBA-网页搜集-在HTML表格单元格中获取价值 [英] Excel VBA - Web Scraping - Get value in HTML Table cell

查看:89
本文介绍了Excel VBA-网页搜集-在HTML表格单元格中获取价值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个抓取货物跟踪网站的宏.但是我必须创建4个这样的宏,因为每个航空公司都有不同的网站.

I am trying to create a macro that scrapes a cargo tracking website. But I have to create 4 such macros as each airline has a different website.

我是VBA和网页抓取的新手.

I am new to VBA and web scraping.

我整理了一个适用于1个网站的代码.但是,当我尝试将其复制另一个时,我陷入了循环.我认为这也许是我指代该元素的方式,但是就像我说的那样,我是VBA的新手,对HTML毫无头绪.

I have put together a code that works for 1 website. But when I tried to replicate it for another one, I am stuck in the loop. I think it maybe how I am referring to the element, but like I said, I am new to VBA and have no clue about HTML.

我正在尝试获取已通知"的图像中突出显示的行中的值.

I am trying to get the "notified" value in the highlighted line from the image.

图像:已通知"要提取的文字下面是我到目前为止编写的陷入循环的代码.任何帮助,将不胜感激.

IMAGE:"notified" text to be extracted Below is the code I have written so far that gets stuck in the loop. Any help with this would be appreciated.

Sub FlightStat_AF()

Dim url As String
Dim ie As Object
Dim nodeTable As Object

  'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
  url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"

  'Initialize Internet Explorer, set visibility,
  'call URL and wait until page is fully loaded
  Set ie = CreateObject("InternetExplorer.Application")
  ie.Visible = False
  ie.navigate url
  Do Until ie.readyState = 4: DoEvents: Loop
  
  'Wait to load dynamic content after IE reports it's ready
  'We can do that in a loop to match the point the information is available
  Do
    On Error Resume Next
    Set nodeTable = ie.document.getElementByClassName("block-whisper")
    On Error GoTo 0
  Loop Until Not nodeTable Is Nothing
  
  'Get the status from the table
  MsgBox Trim(nodeTable.getElementsByClassName("fs-12 body-font-bold").innerText)
  
  'Clean up
  ie.Quit
  Set ie = Nothing
  Set nodeTable = Nothing
End Sub

推荐答案

一些基本知识:
对于像现在这样的简单访问,可以使用DOM(文档对象模型)的get方法.但是 getElementByID() getElementsByClassName()/ getElementsByTagName()之间存在重要区别.

Some basics:
For simple accesses, like the present ones, you can use the get methods of the DOM (Document Object Model). But there is an important difference between getElementByID() and getElementsByClassName() / getElementsByTagName().

getElementByID()搜索html标记的唯一ID.这被写为html标签的ID属性.如果该页面保留了html标准,则只有一个具有此唯一ID的元素.这就是为什么该方法以 getElement 开头的原因.

getElementByID() searches for the unique ID of a html tag. This is written as the ID attribute to html tags. If the html standard is kept by the page, there is only one element with this unique ID. That's the reason why the method begins with getElement.

如果使用该方法时未找到ID,则VBA会引发运行时错误.因此,该调用被封装在循环中,从我得到的另一个答案开始,再次关闭并重新打开错误处理.但是在该问题的页面中,相关HTML区域没有ID.

If the ID is not found when using the method, VBA throws a runtime error. Therefore the call is encapsulated in the loop from the other answer from me, into switching off and on again the error handling. But in the page from this question there is no ID for the html area in question.

相反,可以直接访问所需的元素.您尝试使用 getElementsByClassName()进行访问.这是正确的.但这与 getElementByID()有所不同.

Instead, the required element can be accessed directly. You tried the access with getElementsByClassName(). That's right. But here comes the difference to getElementByID().

getElementsByClassName() getElementsByTagName() getElements 开头.多数民众赞成在复数形式,因为可以根据需要添加具有相同类或标签名称的元素.这两种方法都会创建一个html节点集合.所有具有要求的类或标记名称的html元素都将在这些集合中成为列表.

getElementsByClassName() and getElementsByTagName() begin with getElements. Thats plural because there can be as many elements with the same class or tag name as you want. This both methods create a html node collection. All html elements with the asked class or tag name will be listet in those collections.

所有元素都有索引,就像数组一样.索引从0开始.要访问特定元素,必须指定所需的索引.两个类名称 fs-12 body-font-bold (类名称用空格分隔,您也可以仅使用一个类名称来构建节点集合),将2个html元素传递给该节点集合.您需要第二个,所以必须使用索引1.

All elements have an index, just like an array. The indexes start at 0. To access a particular element, the desired index must be specified. The two class names fs-12 body-font-bold (class names are seperated by spaces, you can also build a node collection by using only one class name) deliver 2 html elements to the node collection. You want the second one so you must use the index 1.

这是使用IE的要求页面的VBA代码:

Sub FlightStat_AF()

Dim url As String
Dim ie As Object

  'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
  url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"

  'Initialize Internet Explorer, set visibility,
  'call URL and wait until page is fully loaded
  Set ie = CreateObject("InternetExplorer.Application")
  ie.Visible = False
  ie.navigate url
  Do Until ie.readyState = 4: DoEvents: Loop
  
  'Wait to load dynamic content after IE reports it's ready
  'We do that with a fix manual break of a few seconds
  'because the whole page will be "reload"
  'The last three values are hours, minutes, seconds
  Application.Wait (Now + TimeSerial(0, 0, 3))
  
  'Get the status from the table
  MsgBox Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
  
  'Clean up
  ie.Quit
  Set ie = Nothing
End Sub

将其作为功能

此子项用于测试功能:

Sub as function

This sub to test the function:

Sub testFunction()
  Dim flightStatAfResult As String
  flightStatAfResult = FlightStat_AF("057-92366691")
  MsgBox flightStatAfResult
End Sub

这是子功能:

Function FlightStat_AF(cargoNo As String) As String

Dim url As String
Dim ie As Object
Dim result As String

  'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
  url = "https://www.afklcargo.com/mycargo/shipment/detail/" & cargoNo

  'Initialize Internet Explorer, set visibility,
  'call URL and wait until page is fully loaded
  Set ie = CreateObject("InternetExplorer.Application")
  ie.Visible = False
  ie.navigate url
  Do Until ie.readyState = 4: DoEvents: Loop
  
  'Wait to load dynamic content after IE reports it's ready
  'We do that with a fix manual break of a few seconds
  'because the whole page will be "reload"
  'The last three values are hours, minutes, seconds
  Application.Wait (Now + TimeSerial(0, 0, 3))
  
  'Get the status from the table
  result = Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
  
  'Clean up
  ie.Quit
  Set ie = Nothing
  
  'Return value of the function
  FlightStat_AF = result
End Function

这篇关于Excel VBA-网页搜集-在HTML表格单元格中获取价值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆