cricinfo记分卡的html解析 [英] html parsing of cricinfo scorecards

查看:29
本文介绍了cricinfo记分卡的html解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标

我希望从 免费下载 (

首先让我们找到包含该信息的表的名称.只需右键单击表格并单击Inspect Element with Firebug",它就会为您提供以下快照.

所以现在我们知道我们的数据存储在一个名为inningsBat1"的表中,如果我们可以将该表的内容提取到 Excel 文件中,那么我们绝对可以使用这些数据进行分析.这是将在 Sheet1 中转储该表的示例代码

在我们继续之前,我建议关闭所有 Excel 并启动一个新实例.

启动 VBA 并插入用户表单.放置一个命令按钮和一个 webcrowser 控件.您的用户表单可能如下所示

将此代码粘贴到用户表单代码区域

选项显式'~~>设置对 Microsoft HTML 对象库的引用Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)私有子 CommandButton1_Click()将 URL 变暗为字符串Dim oSheet 作为工作表设置 oSheet = Sheets("Sheet1")URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"填充数据表 oSheet, URLMsgBox "数据报废.请检查" &oSheet.Name结束子Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)将 tbl 调暗为 HTMLTableDim tr As HTMLTableRowDim insertRow As Long、Row As Long、col As Long出错时转到哇WebBrowser1.navigate URL等待WB就绪设置 tbl = WebBrowser1.Document.getElementById("inningsBat1")与 wsk.Cells.清除插入行 = 0对于 Row = 0 到 tbl.Rows.Length - 1设置 tr = tbl.Rows(Row)如果 Trim(tr.innerText) <>"然后如果 tr.Cells.Length >2 那么如果 tr.Cells(1).innerText <>总"然后插入行 = 插入行 + 1对于 col = 0 To tr.Cells.Length - 1.Cells(insertRow, col + 1) = tr.Cells(col).innerText下一个万一万一万一下一个结束于哇:卸载我结束子私有子等待(ByVal nSec As Long)nSec = nSec + 定时器当定时器 <秒事件睡眠 100温德结束子私有子 WaitForWBReady()等待 1而 WebBrowser1.ReadyState <>4等待 3温德结束子

现在运行您的用户表单并单击命令按钮.您会注意到数据被转储到 Sheet1 中.查看快照

同样,您也可以抓取其他信息.

<小时>

2) 使用 Excel 的内置工具从网络获取数据

<小时>

我相信您使用的是 Excel 2007,因此我将以它为例来抓取上述链接.

导航到 Sheet2.现在导航到数据"选项卡,然后单击最右侧的来自 Web"按钮.查看快照.

在New Web Query Window"中输入网址并点击Go"

页面上传后,通过单击快照中所示的小箭头选择要导入的相关表.完成后,点击导入"

然后,Excel 会询问您要将数据导入到何处.选择相关单元格,然后单击确定".你已经完成了!数据将导入到您指定的单元格中.

如果您希望可以录制宏并自动执行此操作:)

这是我录制的宏.

Sub Macro1()使用 ActiveSheet.QueryTables.Add(Connection:= _网址;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"_, 目的地:=Range("$A$1")).Name = "524915".FieldNames = 真.RowNumbers = 假.FillAdjacentFormulas = False.PreserveFormatting = True.RefreshOnFileOpen = False.BackgroundQuery = True.RefreshStyle = xlInsertDeleteCells.SavePassword = False.SaveData = 真.AdjustColumnWidth = True.RefreshPeriod = 0.WebSelectionType = xlSpecifiedTables.WebFormatting = xlWebFormattingNone.WebTables = """inningsBat1""".WebPreFormattedTextToColumns = True.WebConsecutiveDelimitersAsOne = True.WebSingleBlockTextImport = False.WebDisableDateRecognition = False.WebDisableRedirections = False.Refresh BackgroundQuery:=False结束于结束子

<小时>

希望这会有所帮助.如果您还有疑问,请告诉我.

席德

Aim

I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel

As an example the current Australian Big Bash 2011/12 scorecards are available from

Background

I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie Extract values from HTML TD and Tr

In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags

Query

While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.

Please bear in mind that my preference is a repeatable CSV format containing:

  • the date/name of the match
  • Team 1 name
  • the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
  • Team 2 name
  • the output should dump up to 11 records for Team 2 (blank records where players haven't batted)

Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.

Sample Site links and Data to be Extracted

解决方案

There are 2 techniques that I use for "VBA". I will describe them 1 by one.

1) Using FireFox / Firebug Addon / Fiddler

2) Using Excel's inbuilt facility to get data from the web

Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know


1) Using FireFox / Firebug Addon / Fiddler


FireFox : http://en.wikipedia.org/wiki/Firefox Free download (http://www.mozilla.org/en-US/firefox/new/)

Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29 Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)

Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29 Free download (http://www.fiddler2.com/fiddler2/)

Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.

Another example would be finding the name of a table on a website which has the data that you need scrapped.

I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.

Now let's take a simple example on how to scrape the URL posted in your question

http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html

First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.

So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1

Before we proceed, I would recommend, closing all Excel and starting a fresh instance.

Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this

Paste this code in the Userform code area

Option Explicit

'~~> Set Reference to Microsoft HTML Object Library

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Private Sub CommandButton1_Click()
    Dim URL As String
    Dim oSheet As Worksheet

    Set oSheet = Sheets("Sheet1")

    URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"

    PopulateDataSheets oSheet, URL

    MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub

Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
    Dim tbl As HTMLTable
    Dim tr As HTMLTableRow
    Dim insertRow As Long, Row As Long, col As Long

    On Error GoTo whoa

    WebBrowser1.navigate URL

    WaitForWBReady

    Set tbl = WebBrowser1.Document.getElementById("inningsBat1")

    With wsk
        .Cells.Clear

        insertRow = 0
        For Row = 0 To tbl.Rows.Length - 1
            Set tr = tbl.Rows(Row)
            If Trim(tr.innerText) <> "" Then
                If tr.Cells.Length > 2 Then
                    If tr.Cells(1).innerText <> "Total" Then
                        insertRow = insertRow + 1
                        For col = 0 To tr.Cells.Length - 1
                            .Cells(insertRow, col + 1) = tr.Cells(col).innerText
                        Next
                    End If
                End If
            End If
        Next
    End With
whoa:
    Unload Me
End Sub

Private Sub Wait(ByVal nSec As Long)
    nSec = nSec + Timer
    While Timer < nSec
       DoEvents
        Sleep 100
    Wend
End Sub

Private Sub WaitForWBReady()
    Wait 1
    While WebBrowser1.ReadyState <> 4
        Wait 3
    Wend
End Sub

Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot

Similarly you can scrape other info as well.


2) Using Excel's inbuilt facility to get data from the web


I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.

Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.

Enter the url in the "New Web Query Window" and click on "Go"

Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"

Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.

If you wish you can record a macro and automate this as well :)

Here is the macro that I recorded.

Sub Macro1()
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
    , Destination:=Range("$A$1"))
        .Name = "524915"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlSpecifiedTables
        .WebFormatting = xlWebFormattingNone
        .WebTables = """inningsBat1"""
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub


Hope this helps. Let me know if you still have some queries.

Sid

这篇关于cricinfo记分卡的html解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆