使用Unicode字符从本地HTML抓取表格 [英] Scraping table from local HTML with unicode characters

查看:95
本文介绍了使用Unicode字符从本地HTML抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试过以下代码从PC上存储的本地HTML文件中抓取表格

I have tried the following code to scrape a table from local HTML file stored on my PC

Sub Test()
Dim mtbl            As Object
Dim tableData       As Object
Dim tRow            As Object
Dim tcell           As Object
Dim trowNum         As Integer
Dim tcellNum        As Integer
Dim webpage         As New HTMLDocument
Dim fPath           As String
Dim strCnt          As String
Dim f               As Integer

fPath = Environ("USERPROFILE") & "\Desktop\LocalHTML.txt"
f = FreeFile()
Open fPath For Input As #f
strCnt = Input(LOF(f), f)
Close #f

webpage.body.innerHTML = strCnt

Set mtbl = webpage.getElementsByTagName("Table")(0)
Set tableData = mtbl.getElementsByTagName("tr")
Debug.Print tableData.Item(0).innerText

On Error GoTo TryAgain:
trowNum = 1

For Each tRow In tableData
    For Each tcell In tRow.Children
        tcellNum = tcellNum + 1
        Sheet1.Cells(trowNum, tcellNum) = tcell.innerText
    Next tcell
    trowNum = trowNum + 1
    tcellNum = 0
Next tRow
Exit Sub

TryAgain:
Application.Wait Now + TimeValue("00:00:02")
Err.Clear
Resume
End Sub

代码可以正常工作,但结果有两点不正确 首先,阿拉伯语字符在工作表上显示为问号.我的意思是Unicode字符无法正确读取 第二点,数据散乱地散布在工作表上

The code works with no errors but the results are incorrect in two points First the characters in Arabic appears on worksheet as questions marks. I mean the unicode characters are not read correctly Second point the data is scattered on the sheet in an unorganized structure

这是本地HTML文件的链接 http://www.mediafire.com/file/oxpyzv4gc53kuwg/LocalHTML.txt

Here's the link of the local HTML file http://www.mediafire.com/file/oxpyzv4gc53kuwg/LocalHTML.txt

感谢高级帮助

推荐答案

因此,也许这会有所帮助.这不是我想给出的完整答案.基本上,HTML是一团糟(我认为).您不会以可轻松隔离单个文本元素的方式将数据布置在行(tr)和表单元格(td)内.

So, maybe this will help a little. It is not the complete answer I would like to give. Basically, the HTML is a mess (in my opinion). You don't have data laid out in rows (tr), with table cells (td) within, in a manner that you can use to easily isolate individual text elements.

我提供以下内容实际上只是为了说明尝试隔离单个文本组件并使用保留的阿拉伯字符进行读取/写入的怪异之处.我从 @whom 借用了adodb流方法,以确保UTF-8.

I am offering the following really only to demonstrate the oddities of trying to isolate individual text components and to read/write with arabic characters preserved. I borrowed an adodb stream method from @whom to ensure UTF-8.

这种用硬编码编号循环table标签等的方法很丑陋,实际上属于sin bin.我使用这样一个事实,即以后的表将您的各个组件分别存储,以重构具有行和列的整体表外观.

This method, looping table tags etc with hardcoded numbering, is ugly and really belongs in the sin bin. I use the fact that later tables have your individual components stored individually to reconstruct an overall table appearance with rows and columns.

但是您可能会从中得到一些东西

But you may get something from it:

Option Explicit

Public Sub test()
    Dim fStream  As ADODB.Stream, html As HTMLDocument
    Set html = New HTMLDocument
    Set fStream = New ADODB.Stream
    With fStream
        .Charset = "UTF-8"
        .Open
        .LoadFromFile "C:\Users\User\Downloads\LocalHTML.html"
        html.body.innerHTML = .ReadText
        .Close
    End With

    Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
    Dim counter As Long, endTableNumber, numColumns As Long

    startTableNumber = 43
    endTableNumber = 330
    numColumns = 9

    Set hTables = html.getElementsByTagName("table")
    r = 2: c = 1

    For i = startTableNumber To endTableNumber Step 2
        counter = counter + 1
        If counter = 10 Then
            c = 1: r = r + 1: counter = 1
        End If
        Cells(r, c) = hTables(i).innerText
        c = c + 1
    Next

End Sub

这篇关于使用Unicode字符从本地HTML抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆