PowerShell:将 HTML 表提取为 CSV [英] PowerShell: Extracting HTML table as CSV

查看:59
本文介绍了PowerShell:将 HTML 表提取为 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 HTML 表格提取到 CSV 文件.我对 PowerShell 了解不多,但在网上我找到了一些示例,但我总是收到相同的错误消息:

I am trying to extract a HTML table to a CSV file. I do not know a lot from PowerShell but online I've found some examples, but I always get the same error message:

您不能在空值表达式上调用方法.在行:8 字符:1

You cannot call a method on a null-valued expression. At line:8 char:1

  • $table = $oHTML.ParsedHtml.body.getElementsByTagName('table')[0]

  • CategoryInfo : InvalidOperation: (:) [], RuntimeException
  • FullyQualifiedErrorId : InvokeMethodOnNull
  • 这是我目前所拥有的,但有点卡住了.

    This is what I have for the moment, but am a bit stuck.

    $url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
    $webClient = New-Object System.Net.Webclient
    $webClient.DownloadString($url) | Out-File -FilePath C:\Users\USER\Downloads\DUMP\dump.html
    
    $oHTML = Get-Content C:\Users\USER\Downloads\DUMP\dump.html -Raw
    
    #Just grabbing first table for my testing
    $table = $oHTML.ParsedHtml.body.getElementsByTagName('Table')[0]
    
    $Headers = ($table.Rows[0].Cells | Select -ExpandProperty innerText).trim()
    $psCollection=@()
    
    $dataRows = $table.Rows | Select -Skip 1
    foreach ($tablerow in $dataRows) {
        $cells = ($tablerow.Cells | Select -ExpandProperty innerText).trim()
        $obj = New-Object -TypeName PSObject
        $count = 0;
        foreach ($cell in $cells) {
            if ($count -lt $Headers.length) {
                $obj | Add-Member -MemberType NoteProperty -Name $Headers[$count++] -Value $cell
            }
        }
        $psCollection+=$obj
    }
    
    $psCollection | Select 'MyField' -Unique
    

    推荐答案

    下面是使用 IHTMLDocument2 界面:

    Function Read-HtmlTable {
        [CmdletBinding(DefaultParameterSetName = 'Html')][OutputType([Object[]])] param(
            [Parameter(ParameterSetName='HtmlSet', ValueFromPipeLine = $True, Mandatory = $True)][String]$Html,
            [Parameter(ParameterSetName='UriSet',  ValueFromPipeLine = $True, Mandatory = $True)][Uri]$Uri,
            [Int[]]$TableIndex,
            [Int]$RowIndex
        )
        Begin {
            function Get-TopElements {
                [CmdletBinding()][OutputType([__ComObject[]])] param(
                    [Parameter(Mandatory = $True)][String]$TagName,
                    [Parameter(Mandatory = $True, ValueFromPipeLine = $True)]$Element
                )
                if ($Element.tagName -eq $TagName) { $Element }
                else { $Element.Children | Foreach-Object { $_ | Get-TopElements $TagName } } 
            }
        }
        Process {
            if (!$Uri -and $Html.Length -le 2048 -and ([Uri]$Html).AbsoluteUri) { $Uri = [Uri]$Html }
            if ($Uri.AbsoluteUri) { $Html = [System.Net.Webclient]::New().DownloadString($Uri) }
            $Unicode = [System.Text.Encoding]::Unicode.GetBytes($Html)
            $Document = New-Object -Com 'HTMLFile'
            if ($Document.IHTMLDocument2_Write) { $Document.IHTMLDocument2_Write($Unicode) } else { $Document.write($Unicode) }
            $Index = 0 
            foreach($Table in ($Document.Body | Get-TopElements 'table')) {
                if (!$PSBoundParameters.ContainsKey('TableIndex') -or $Index++ -In $TableIndex) {
                    $Names = $Null
                    $THead = $Table | Get-TopElements 'thead'
                    $TBody = $Table | Get-TopElements 'tbody'
                    $TableHead = if ($THead) { $THead } else { $Table }
                    $TableBody = if ($TBody) { $TBody } else { $Table }
                    $HeaderRows = $TableHead | Get-TopElements 'tr'
                    if ($PSBoundParameters.ContainsKey('RowLocation')) { $Rows[$RowIndex] }
                    else {
                        foreach ($HeaderRow in $HeaderRows) { 
                            $TH = $HeaderRow | Get-TopElements 'th'
                            if (!$Names -or $TH -and $TH.Count -gt $Names.Count) { $Names = @($TH.innerText) }
                            elseif ($Names -and $TH) {break }
                            if (!$TH -or !$Names -or $Names[0].TagName -ne 'th') {
                                $TD = $HeaderRow | Get-TopElements 'td'
                                if (!$Names -or $TD -and $TD.Count -gt $Names.Count) {$Names = @($TD.innerText) }
                                elseif ($Names -and $TD) { break }
                            }
                        }
                    }
                    foreach ($TableRow in ($TableBody | Get-TopElements 'tr')) {
                        if ($THead -or $TBody -or $TableRow.rowIndex -ge $HeaderRow.RowIndex) {
                            $Values = @(($TableRow | Get-TopElements 'td').innerText)
                            $Properties = [Ordered]@{}
                            $Count = [Math]::Min($Names.Count, $Values.Count)
                            for ($i = 0; $i -lt $Count; $i++) { $Properties[$Names[$i]] = $Values[$i] }
                            if ($Properties.Count -gt 0) { [pscustomobject]$Properties }
                        }
                    }
                }
            }
        }
    }
    

    用法:

    $url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
    $webClient = New-Object System.Net.Webclient
    $HTML = $webClient.DownloadString($url)
    
    $HTML | Read-HtmlTable -Table 0 | Format-Table
    

    结果:

    Version Servicing option               Availability date OS build   Latest revision date End of service: Home, Pro, Pro Education, Pro for Workstations and IoT Core End of service: Enterprise, Education and IoT Enterprise
    ------- ----------------               ----------------- --------   -------------------- --------------------------------------------------------------------------- --------------------------------------------------------
    20H2    Semi-Annual Channel            2020-10-20        19042.928  2021-04-13           2022-05-10                                                                  2023-05-09
    2004    Semi-Annual Channel            2020-05-27        19041.928  2021-04-13           2021-12-14                                                                  2021-12-14
    1909    Semi-Annual Channel            2019-11-12        18363.1500 2021-04-13           2021-05-11                                                                  2022-05-10
    1809    Semi-Annual Channel            2019-03-28        17763.1879 2021-04-13           End of service                                                              2021-05-11
    1809    Semi-Annual Channel (Targeted) 2018-11-13        17763.1879 2021-04-13           End of service                                                              2021-05-11
    1803    Semi-Annual Channel            2018-07-10        17134.2145 2021-04-13           End of service                                                              2021-05-11
    1803    Semi-Annual Channel (Targeted) 2018-04-30        17134.2145 2021-04-13           End of service                                                              2021-05-11
    

    这篇关于PowerShell:将 HTML 表提取为 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆