如何使用本地powershell命令从html文件中提取特定表? [英] How to extract specific tables from html file using native powershell commands?

查看:391
本文介绍了如何使用本地powershell命令从html文件中提取特定表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用PAL工具( https://pal.codeplex.com/ )生成来自Windows中的perfmon日志的HTML报告。在PAL处理来自perfmon的.blg文件后,它将信息转储到包含有关系统如何执行的各种数据点的表的HTML文档中。我目前正在编写一个脚本,它查看所有HTML文件的目录内容,并对所有HTML文件执行get-content。

I make use of the PAL tool (https://pal.codeplex.com/) to generate HTML reports from perfmon logs within Windows. After PAL processes .blg files from perfmon it dumps the information into an HTML document that contains tables with various data points about how the system performed. I am currently writing a script that looks at the contents of a directory for all HTML files, and does a get-content on all the HTML files.

我想做的是为具有不同行数的特定表格刮取此get-content blob的转储。是否可以使用本机PowerShell cmdlet查找特定表,计算每个表中的行数,并转储所需的表和表行?

What I would like to do is scrape the dump of this get-content blob for specific tables that have varying amount of rows. Is it possible using native powershell cmdlets to look for specific tables, count how many rows are in each table, and dump just the desired tables and table rows?

以下是我尝试删除的表格格式的示例:

Here is an example of the table format I'm trying to scrape:

<H3>Overall Counter Instance Statistics</H3>
<TABLE ID="table6" BORDER=1 CELLPADDING=2>
<TR><TH><B>Condition</B></TH><TH><B>\LogicalDisk(*)\Disk Transfers/sec</B></TH><TH><B>Min</B></TH><TH><B>Avg</B></TH><TH><B>Max</B></TH><TH><B>Hourly Trend</B></TH><TH><B>Std Deviation</B></TH><TH><B>10% of Outliers Removed</B></TH><TH><B>20% of Outliers Removed</B></TH><TH><B>30% of Outliers Removed</B></TH></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/C:</TD><TD>1</TD><TD>7</TD><TD>310</TD><TD>0</TD><TD>11</TD><TD>5</TD><TD>5</TD><TD>5</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/D:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/E:</TD><TD>0</TD><TD>24</TD><TD>164</TD><TD>-1</TD><TD>11</TD><TD>22</TD><TD>21</TD><TD>20</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/HarddiskVolume5</TD><TD>0</TD><TD>0</TD><TD>2</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/L:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/T:</TD><TD>0</TD><TD>7</TD><TD>430</TD><TD>0</TD><TD>21</TD><TD>3</TD><TD>2</TD><TD>2</TD></TR>
</TABLE>

表ID在所有输出文件中是不变的,但表行数不是。

The Table ID is constant among all the output files, but the amount of table rows is not. Any help is appreciated!

推荐答案

确定,这没有彻底测试,但适用于您的示例表在PS 2.0与IE11:

OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:

# Parsing HTML with IE.
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate("file.html")
$oHtmlDoc = $oIE.Document

# Getting table by ID.
$oTable = $oHtmlDoc.getElementByID("table6")

# Extracting table rows as a collection.
$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }
$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }

# Creating a collection of table headers.
$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }
$cHeaders = @()
foreach ($oTh in $cThs) {
    $cHeaders += `
        ($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML
}

# Converting rows to a collection of PS objects exportable to CSV.
$cCsv = @()
foreach ($oTr in $cTrs) {
    $cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }
    # Skipping the first row (headers).
    if ([String]::IsNullOrEmpty($cTds)) { continue }
    $oRow = New-Object PSObject
    for ($i = 0; $i -lt $cHeaders.Count; $i++) {
        $oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `
            -Value $cTds[$i].innerHTML
    }
    $cCsv += $oRow
}

# Closing IE.
$oIE.Quit()

# Exporting CSV.
$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation

老实说,码。这只是一个如何使用PS中的DOM对象并将它们转换为PS对象的示例。

Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.

这篇关于如何使用本地powershell命令从html文件中提取特定表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆