如何使用Powershell将HTML表格转换为具有相同结构的CSV文件 [英] How to Convert HTML table to CSV file with same structure with powershell
问题描述
有了Powershell,我可以得到这张桌子- $ URL =" http://example.com/yyy.htm " $ OutputFile ="$ env:temp \ tempfile.xml"
With Powershell, I can get the table with this - $URL = "http://example.com/yyy.htm" $OutputFile = "$env:temp\tempfile.xml"
# reading website data:
$data = Invoke-WebRequest -Uri $URL
# get the first table found on the website and write it to disk:
@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | Set-Content -Path $OutputFile
现在我希望将此表转换为CSV ...我该怎么做?
Now I want this table to be converted to CSV... How do I do that?
表格示例-
Datacenter | FirstDNS | SecondDNS | ThirdDNS | FourthDNS
-----------------------------------------------------------
NewYork | 1.1.1.1 | 2.2.2.2 |3.3.3.3 | 4.4.4.4
India | 1.2.3.4 | 3.2.6.5 |8.2.3.7 | 8.3.66.1
推荐答案
以下是将HTML表转换为PSObject的解决方案,您可以将其通过管道传输到Export-CSV
或执行所需的任何操作.
请注意:这不是一个干净的解决方案;它仅适用于简单场景,但是存在很多问题:
Here's a solution convert HTML tables to PSObjects, which you can then pipe to Export-CSV
or do whatever you need to.
Please note: this is not a clean solution; it just about does the job for simple scenarios, but has a lot of issues:
- 不能应付特殊字符(
除外,要使其正常工作,您需要根据需要在DocType
的实体图中添加新的定义) - 无法应付
colspan
或rowspan
;假设所有表的每一行中的列数与标题中的列数相同(如果列数比标题多,可以进行调整以防止错误;但是在这种情况下,您仍然可能会出现对齐错误的情况.) - 我在将HTML转换为XML之前清理HTML的技术是使用正则表达式而不是解析库.所以那里很可能会有意想不到的问题.
- Can't cope with special characters (other than
, to get it to work you'll need to add new definitions to theDocType
's entity map as required) - Can't cope with
colspan
orrowspan
; assumes that all tables have the same number of columns in every row as they had in the header (there's a tweak to prevent errors if there's more columns than headers; but you may still get misalignment in that scenario). - My technique for cleaning the HTML before converting to XML was to use a regex rather than a parsing library; so there could well be unexpected issues there.
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name
} else {
$colCount = ($cols | Measure-Object).Count - 1
$result = new-object -TypeName PSObject
0..$colCount | %{
$colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers
$colValue = $cols[$_]
$result | Add-Member NoteProperty $colName $colValue
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'}))
[string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers.Count -gt 0) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
clear-host
[System.Uri]$url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions'
$rqst = Invoke-WebRequest $url
$rqst.ParsedHtml.getElementsByTagName('table') | ConvertFrom-HtmlTable
FYI: I've also published an earlier version of this code on CodeReview, so check there to see if anyone suggests any good improvements.
这篇关于如何使用Powershell将HTML表格转换为具有相同结构的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!