使用PowerShell检索HTML中的文本 [英] Retrieve text in HTML with powershell
问题描述
< div id =ajaxWarningRegionclass =infoFont>< DIV>
< span id =ajaxStatusRegion>< / span>
< form enctype =multipart / form-datamethod =postname =confIPBackupFormaction =/ cgi-bin / utilserv / confIPBackup / w_confIPBackupid =confIPBackupForm>
< pre>
从HTTP / PhoneBackup创建IP电话文件的新ZIP
和HTTPS / PhoneBackup
< / pre>
< pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
< pre>报告成功< / pre>
< pre>< / pre>
< a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
下载IP电话文件的新ZIP文件
< / a>
< / div>
我想检索文本 IP_PHONE_BACKUP-2012-Jul-25_15:47 :47.zip
或只是 IP_PHONE_BACKUP -
和 .zip
之间的日期和时间
我该怎么做?
有趣的是,HTML看起来和闻起来就像XML一样,后者由于其良好的行为和有序的结构而更具可编程性。在一个理想的世界中,HTML将是XML的一个子集,但是现实世界中的HTML是着重于XML而不是XML。如果您将问题中的示例提供给任何XML解析器,它将会妨碍各种违规行为。这就是说,只需一行PowerShell就能达到预期的效果。这个返回整个href文本:
Select-NodeContent $ doc.DocumentNode// a / @ href
这个提取所需的子字符串:
Select-NodeContent $ doc.DocumentNode// a / @ hrefIP_PHONE_BACKUP - (。*)\.zip
pre>
然而,捕获的开销/设置可以运行那一行代码。您需要:安装 > HtmlAgilityPack ,使HTML解析看起来就像XML解析一样。
安装 PowerShell社区扩展 如果你想解析一个实时网页。 理解XPath 能够构建到目标节点的可导航路径。了解正则表达式以便能够从目标节点中提取子字符串。
$ b满足这些要求后,您可以将
HTMLAgilityPath
类型添加到您的环境中,并定义Select-NodeContent
函数,如下所示。代码的最后一部分显示了如何为上面的单行使用的$ doc
变量赋值。我展示了如何从文件或网页加载HTML,具体取决于您的需求。Set-StrictMode -Version最新
$ HtmlAgilityPackPath = [System.IO.Path] :: Combine((Get-Item $ PROFILE).DirectoryName,bin\HtmlAgilityPack.dll)
Add-Type -Path $ HtmlAgilityPackPath
函数Select-NodeContent(
[HtmlAgilityPack.HtmlNode] $ node,
[string] $ xpath,
[string] $ regex,
[Object] $如果给定标准的XPath来检索一个属性,那么给出一个属性(默认=)
{
if($ xpath -match(。*)/ @(\w +)$){
# ,
#映射到支持的操作以检索属性的文本。
($ xpath,$ attribute)= $ matches [1],$ matches [2]
$ resultNode = $ node.SelectSingleNode($ xpath)
$ text =?:{$ resultNode } {$ resultNode.Attributes [$ attribute] .Value} {$ default}
}
else {#检索元素的文本
$ resultNode = $ node.SelectSingleNode($ xpath)
$ text =?:{$ resultNode} {$ resultNode.InnerText} {$ default}
}
#如果给出了一个正则表达式,用它从文本
中提取一个子字符串if($ regex){
if($ text -match $ regex){$ text = $ matches [1]}
else {$ text = $ default}
}
返回$ text
}
$ doc = New-Object HtmlAgilityPack.HtmlDocument
$ result = $ doc.Load(tmp\temp.html)#使用它加载文件
#$ result = $ doc.LoadHtml((Get-HttpResource $ url))#使用此PSCX cmdlet加载实时网页
In this html code :
<div id="ajaxWarningRegion" class="infoFont"></div> <span id="ajaxStatusRegion"></span> <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" > <pre> Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup </pre> <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre> <pre>Reports Success</pre> <pre></pre> <a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip> Download the new ZIP of IP Phone files </a> </div>
I want to retrieve the text
IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip
or just the date and hour betweenIP_PHONE_BACKUP-
and.zip
How can I do that ?
解决方案What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/@href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
- Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
- Install PowerShell Community Extensions if you want to parse a live web page.
- Understand XPath to be able to construct a navigable path to your target node.
- Understand regular expressions to be able to extract a substring from your target node.
With those requirements satisfied you can add the HTMLAgilityPath
type to your environment and define the Select-NodeContent
function, both shown below. The very end of the code shows how you assign a value to the $doc
variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/@(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page
这篇关于使用PowerShell检索HTML中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!