使用PowerShell检索HTML中的文本 [英] Retrieve text in HTML with powershell

查看：1098 发布时间：2018/6/15 13:07:24 html regex powershell

本文介绍了使用PowerShell检索HTML中的文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在这个html代码中：

< div id =ajaxWarningRegionclass =infoFont>< DIV> < span id =ajaxStatusRegion>< / span> < form enctype =multipart / form-datamethod =postname =confIPBackupFormaction =/ cgi-bin / utilserv / confIPBackup / w_confIPBackupid =confIPBackupForm> < pre> 从HTTP / PhoneBackup创建IP电话文件的新ZIP 和HTTPS / PhoneBackup < / pre> < pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre> < pre>报告成功< / pre> < pre>< / pre> < a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip> 下载IP电话文件的新ZIP文件 < / a> < / div>
我想检索文本 IP_PHONE_BACKUP-2012-Jul-25_15：47 ：47.zip 或只是 IP_PHONE_BACKUP - 和 .zip 之间的日期和时间

我该怎么做？

解决方案
有趣的是，HTML看起来和闻起来就像XML一样，后者由于其良好的行为和有序的结构而更具可编程性。在一个理想的世界中，HTML将是XML的一个子集，但是现实世界中的HTML是着重于XML而不是XML。如果您将问题中的示例提供给任何XML解析器，它将会妨碍各种违规行为。这就是说，只需一行PowerShell就能达到预期的效果。这个返回整个href文本：

Select-NodeContent $ doc.DocumentNode// a / @ href
这个提取所需的子字符串：
Select-NodeContent $ doc.DocumentNode// a / @ hrefIP_PHONE_BACKUP - （。*）\.zip pre>

然而，捕获的开销/设置可以运行那一行代码。您需要：安装 > HtmlAgilityPack ，使HTML解析看起来就像XML解析一样。

安装 PowerShell社区扩展 如果你想解析一个实时网页。
理解XPath 能够构建到目标节点的可导航路径。了解正则表达式以便能够从目标节点中提取子字符串。

$ b
满足这些要求后，您可以将 HTMLAgilityPath 类型添加到您的环境中，并定义 Select-NodeContent 函数，如下所示。代码的最后一部分显示了如何为上面的单行使用的 $ doc 变量赋值。我展示了如何从文件或网页加载HTML，具体取决于您的需求。
Set-StrictMode -Version最新 $ HtmlAgilityPackPath = [System.IO.Path] :: Combine（（Get-Item $ PROFILE）.DirectoryName，bin\HtmlAgilityPack.dll） Add-Type -Path $ HtmlAgilityPackPath 函数Select-NodeContent（ [HtmlAgilityPack.HtmlNode] $ node， [string] $ xpath， [string] $ regex， [Object] $如果给定标准的XPath来检索一个属性，那么给出一个属性（默认=） { if（$ xpath -match（。*）/ @（\w +）$）{ ＃，＃映射到支持的操作以检索属性的文本。（$ xpath，$ attribute）= $ matches [1]，$ matches [2] $ resultNode = $ node.SelectSingleNode（$ xpath） $ text =？：{$ resultNode } {$ resultNode.Attributes [$ attribute] .Value} {$ default} } else {＃检索元素的文本 $ resultNode = $ node.SelectSingleNode（$ xpath） $ text =？：{$ resultNode} {$ resultNode.InnerText} {$ default} } ＃如果给出了一个正则表达式，用它从文本中提取一个子字符串if（$ regex）{ if（$ text -match $ regex）{$ text = $ matches [1]} else {$ text = $ default} } 返回$ text } $ doc = New-Object HtmlAgilityPack.HtmlDocument $ result = $ doc.Load（tmp\temp.html）＃使用它加载文件＃$ result = $ doc.LoadHtml（（Get-HttpResource $ url））＃使用此PSCX cmdlet加载实时网页

In this html code :
<div id="ajaxWarningRegion" class="infoFont"></div> <span id="ajaxStatusRegion"></span> <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" > <pre> Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup </pre> <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre> <pre>Reports Success</pre> <pre></pre> <a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip> Download the new ZIP of IP Phone files </a> </div>
I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip

How can I do that ?
解决方案
What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/@href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:

Install HtmlAgilityPack to make HTML parsing look just like XML parsing.

Install PowerShell Community Extensions if you want to parse a live web page.

Understand XPath to be able to construct a navigable path to your target node.

Understand regular expressions to be able to extract a substring from your target node.

With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest $HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll") Add-Type -Path $HtmlAgilityPackPath function Select-NodeContent( [HtmlAgilityPack.HtmlNode]$node, [string] $xpath, [string] $regex, [Object] $default = "") { if ($xpath -match "(.*)/@(\w+)$") { # If standard XPath to retrieve an attribute is given, # map to supported operations to retrieve the attribute's text. ($xpath, $attribute) = $matches[1], $matches[2] $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default } } else { # retrieve an element's text $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.InnerText } { $default } } # If a regex is given, use it to extract a substring from the text if ($regex) { if ($text -match $regex) { $text = $matches[1] } else { $text = $default } } return $text } $doc = New-Object HtmlAgilityPack.HtmlDocument $result = $doc.Load("tmp\temp.html") # Use this to load a file #$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page

这篇关于使用PowerShell检索HTML中的文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PowerShell检索HTML中的文本 [英] Retrieve text in HTML with powershell

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用PowerShell检索HTML中的文本 [英] Retrieve text in HTML with powershell

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭