从 HTML 标签中的文件中抓取文本 [英] Scraping text from file within HTML tags

查看:31
本文介绍了从 HTML 标签中的文件中抓取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要从中提取日期的文件,它是一个 HTML 源文件,因此里面充满了我不需要的代码和短语.我需要提取包含在特定 HTML 标记中的日期的每个实例:

I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag:

abbr title="((这是我需要的文字))" data-utime="

abbr title="((this is the text I need))" data-utime="

实现这一目标的最简单方法是什么?

What's the easiest way to achieve this?

推荐答案

如果您使用的是 Excel VBA,请设置对 MSHTML 库(名为 Microsoft HTML Object Library)的引用(工具 - 引用)在参考菜单中)

If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library in the reference menu)

Sub ScrapeDateAbbr()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim sHtml As String

    'read in the file
    lFile = FreeFile
    sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
    Open sFile For Input As lFile
    sHtml = Input$(LOF(lFile), lFile)

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = sHtml

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

因为您调用了源文件,所以我认为该文件是本地文件.如果您需要先下载它,则需要另一个对 MSXML 和此代码的引用

I assumed the file was local since you called in a source file. If you need to download it first, you'd need another reference to MSXML and this code

Sub ScrapeDateAbbrDownload()

    Dim xHttp As MSXML2.XMLHTTP
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement

    Set xHttp = New MSXML2.XMLHTTP
    xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
    xHttp.send

    Do
        DoEvents
    Loop Until xHttp.readyState = 4

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = xHttp.responseText

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

这篇关于从 HTML 标签中的文件中抓取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆