PowerShell:将 HTML 表提取为 CSV [英] PowerShell: Extracting HTML table as CSV
本文介绍了PowerShell:将 HTML 表提取为 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试将 HTML 表格提取到 CSV 文件.我对 PowerShell 了解不多,但在网上我找到了一些示例,但我总是收到相同的错误消息:
I am trying to extract a HTML table to a CSV file. I do not know a lot from PowerShell but online I've found some examples, but I always get the same error message:
您不能在空值表达式上调用方法.在行:8 字符:1
You cannot call a method on a null-valued expression. At line:8 char:1
- $table = $oHTML.ParsedHtml.body.getElementsByTagName('table')[0]
这是我目前所拥有的,但有点卡住了.
This is what I have for the moment, but am a bit stuck.
$url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
$webClient = New-Object System.Net.Webclient
$webClient.DownloadString($url) | Out-File -FilePath C:\Users\USER\Downloads\DUMP\dump.html
$oHTML = Get-Content C:\Users\USER\Downloads\DUMP\dump.html -Raw
#Just grabbing first table for my testing
$table = $oHTML.ParsedHtml.body.getElementsByTagName('Table')[0]
$Headers = ($table.Rows[0].Cells | Select -ExpandProperty innerText).trim()
$psCollection=@()
$dataRows = $table.Rows | Select -Skip 1
foreach ($tablerow in $dataRows) {
$cells = ($tablerow.Cells | Select -ExpandProperty innerText).trim()
$obj = New-Object -TypeName PSObject
$count = 0;
foreach ($cell in $cells) {
if ($count -lt $Headers.length) {
$obj | Add-Member -MemberType NoteProperty -Name $Headers[$count++] -Value $cell
}
}
$psCollection+=$obj
}
$psCollection | Select 'MyField' -Unique
推荐答案
下面是使用 IHTMLDocument2 界面:
Function Read-HtmlTable {
[CmdletBinding(DefaultParameterSetName = 'Html')][OutputType([Object[]])] param(
[Parameter(ParameterSetName='HtmlSet', ValueFromPipeLine = $True, Mandatory = $True)][String]$Html,
[Parameter(ParameterSetName='UriSet', ValueFromPipeLine = $True, Mandatory = $True)][Uri]$Uri,
[Int[]]$TableIndex,
[Int]$RowIndex
)
Begin {
function Get-TopElements {
[CmdletBinding()][OutputType([__ComObject[]])] param(
[Parameter(Mandatory = $True)][String]$TagName,
[Parameter(Mandatory = $True, ValueFromPipeLine = $True)]$Element
)
if ($Element.tagName -eq $TagName) { $Element }
else { $Element.Children | Foreach-Object { $_ | Get-TopElements $TagName } }
}
}
Process {
if (!$Uri -and $Html.Length -le 2048 -and ([Uri]$Html).AbsoluteUri) { $Uri = [Uri]$Html }
if ($Uri.AbsoluteUri) { $Html = [System.Net.Webclient]::New().DownloadString($Uri) }
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($Html)
$Document = New-Object -Com 'HTMLFile'
if ($Document.IHTMLDocument2_Write) { $Document.IHTMLDocument2_Write($Unicode) } else { $Document.write($Unicode) }
$Index = 0
foreach($Table in ($Document.Body | Get-TopElements 'table')) {
if (!$PSBoundParameters.ContainsKey('TableIndex') -or $Index++ -In $TableIndex) {
$Names = $Null
$THead = $Table | Get-TopElements 'thead'
$TBody = $Table | Get-TopElements 'tbody'
$TableHead = if ($THead) { $THead } else { $Table }
$TableBody = if ($TBody) { $TBody } else { $Table }
$HeaderRows = $TableHead | Get-TopElements 'tr'
if ($PSBoundParameters.ContainsKey('RowLocation')) { $Rows[$RowIndex] }
else {
foreach ($HeaderRow in $HeaderRows) {
$TH = $HeaderRow | Get-TopElements 'th'
if (!$Names -or $TH -and $TH.Count -gt $Names.Count) { $Names = @($TH.innerText) }
elseif ($Names -and $TH) {break }
if (!$TH -or !$Names -or $Names[0].TagName -ne 'th') {
$TD = $HeaderRow | Get-TopElements 'td'
if (!$Names -or $TD -and $TD.Count -gt $Names.Count) {$Names = @($TD.innerText) }
elseif ($Names -and $TD) { break }
}
}
}
foreach ($TableRow in ($TableBody | Get-TopElements 'tr')) {
if ($THead -or $TBody -or $TableRow.rowIndex -ge $HeaderRow.RowIndex) {
$Values = @(($TableRow | Get-TopElements 'td').innerText)
$Properties = [Ordered]@{}
$Count = [Math]::Min($Names.Count, $Values.Count)
for ($i = 0; $i -lt $Count; $i++) { $Properties[$Names[$i]] = $Values[$i] }
if ($Properties.Count -gt 0) { [pscustomobject]$Properties }
}
}
}
}
}
}
用法:
$url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
$webClient = New-Object System.Net.Webclient
$HTML = $webClient.DownloadString($url)
$HTML | Read-HtmlTable -Table 0 | Format-Table
结果:
Version Servicing option Availability date OS build Latest revision date End of service: Home, Pro, Pro Education, Pro for Workstations and IoT Core End of service: Enterprise, Education and IoT Enterprise
------- ---------------- ----------------- -------- -------------------- --------------------------------------------------------------------------- --------------------------------------------------------
20H2 Semi-Annual Channel 2020-10-20 19042.928 2021-04-13 2022-05-10 2023-05-09
2004 Semi-Annual Channel 2020-05-27 19041.928 2021-04-13 2021-12-14 2021-12-14
1909 Semi-Annual Channel 2019-11-12 18363.1500 2021-04-13 2021-05-11 2022-05-10
1809 Semi-Annual Channel 2019-03-28 17763.1879 2021-04-13 End of service 2021-05-11
1809 Semi-Annual Channel (Targeted) 2018-11-13 17763.1879 2021-04-13 End of service 2021-05-11
1803 Semi-Annual Channel 2018-07-10 17134.2145 2021-04-13 End of service 2021-05-11
1803 Semi-Annual Channel (Targeted) 2018-04-30 17134.2145 2021-04-13 End of service 2021-05-11
这篇关于PowerShell:将 HTML 表提取为 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文