使用PowerShell检索HTML中的文本 [英] Retrieve text in HTML with powershell

查看:1098
本文介绍了使用PowerShell检索HTML中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这个html代码中:

 < div id =ajaxWarningRegionclass =infoFont>< DIV> 
< span id =ajaxStatusRegion>< / span>
< form enctype =multipart / form-datamethod =postname =confIPBackupFormaction =/ cgi-bin / utilserv / confIPBackup / w_confIPBackupid =confIPBackupForm>
< pre>
从HTTP / PhoneBackup创建IP电话文件的新ZIP
和HTTPS / PhoneBackup
< / pre>
< pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
< pre>报告成功< / pre>
< pre>< / pre>
< a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
下载IP电话文件的新ZIP文件
< / a>
< / div>

我想检索文本 IP_PHONE_BACKUP-2012-Jul-25_15:47 :47.zip 或只是 IP_PHONE_BACKUP - .zip 之间的日期和时间



我该怎么做?

解决方案

有趣的是,HTML看起来和闻起来就像XML一样,后者由于其良好的行为和有序的结构而更具可编程性。在一个理想的世界中,HTML将是XML的一个子集,但是现实世界中的HTML是着重于XML而不是XML。如果您将问题中的示例提供给任何XML解析器,它将会妨碍各种违规行为。这就是说,只需一行PowerShell就能达到预期的效果。这个返回整个href文本:

  Select-NodeContent $ doc.DocumentNode// a / @ href

这个提取所需的子字符串:

  Select-NodeContent $ doc.DocumentNode// a / @ hrefIP_PHONE_BACKUP  - (。*)\.zip
pre>

然而,捕获的开销/设置可以运行那一行代码。您需要:安装 > HtmlAgilityPack ,使HTML解析看起来就像XML解析一样。

  • 安装 PowerShell社区扩展 如果你想解析一个实时网页。
  • 理解XPath 能够构建到目标节点的可导航路径。了解正则表达式以便能够从目标节点中提取子字符串。


    $ b

    满足这些要求后,您可以将 HTMLAgilityPath 类型添加到您的环境中,并定义 Select-NodeContent 函数,如下所示。代码的最后一部分显示了如何为上面的单行使用的 $ doc 变量赋值。我展示了如何从文件或网页加载HTML,具体取决于您的需求。

      Set-StrictMode -Version最新
    $ HtmlAgilityPackPath = [System.IO.Path] :: Combine((Get-Item $ PROFILE).DirectoryName,bin\HtmlAgilityPack.dll)
    Add-Type -Path $ HtmlAgilityPackPath

    函数Select-NodeContent(
    [HtmlAgilityPack.HtmlNode] $ node,
    [string] $ xpath,
    [string] $ regex,
    [Object] $如果给定标准的XPath来检索一个属性,那么给出一个属性(默认=)
    {
    if($ xpath -match(。*)/ @(\w +)$){
    # ,
    #映射到支持的操作以检索属性的文本。
    ($ xpath,$ attribute)= $ matches [1],$ matches [2]
    $ resultNode = $ node.SelectSingleNode($ xpath)
    $ text =?:{$ resultNode } {$ resultNode.Attributes [$ attribute] .Value} {$ default}
    }
    else {#检索元素的文本
    $ resultNode = $ node.SelectSingleNode($ xpath)
    $ text =?:{$ resultNode} {$ resultNode.InnerText} {$ default}
    }
    #如果给出了一个正则表达式,用它从文本
    中提取一个子字符串if($ regex){
    if($ text -match $ regex){$ text = $ matches [1]}
    else {$ text = $ default}
    }
    返回$ text
    }

    $ doc = New-Object HtmlAgilityPack.HtmlDocument
    $ result = $ doc.Load(tmp\temp.html)#使用它加载文件
    #$ result = $ doc.LoadHtml((Get-HttpResource $ url))#使用此PSCX cmdlet加载实时网页


    In this html code :

    <div id="ajaxWarningRegion" class="infoFont"></div>
      <span id="ajaxStatusRegion"></span>
      <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
        <pre>
          Creating a new ZIP of IP Phone files from HTTP/PhoneBackup 
          and HTTPS/PhoneBackup
        </pre>
        <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
        <pre>Reports Success</pre>
        <pre></pre>
        <a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
          Download the new ZIP of IP Phone files
        </a>
      </div>
    

    I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip

    How can I do that ?

    解决方案

    What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:

    Select-NodeContent $doc.DocumentNode "//a/@href"
    

    And this one extracts the desired substring:

    Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
    

    The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:

    • Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
    • Install PowerShell Community Extensions if you want to parse a live web page.
    • Understand XPath to be able to construct a navigable path to your target node.
    • Understand regular expressions to be able to extract a substring from your target node.

    With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.

    Set-StrictMode -Version Latest
    $HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
    Add-Type -Path $HtmlAgilityPackPath
    
    function Select-NodeContent(
        [HtmlAgilityPack.HtmlNode]$node,
        [string] $xpath,
        [string] $regex,
        [Object] $default = "")
    {
        if ($xpath -match "(.*)/@(\w+)$") {
            # If standard XPath to retrieve an attribute is given,
            # map to supported operations to retrieve the attribute's text.
            ($xpath, $attribute) = $matches[1], $matches[2]
            $resultNode = $node.SelectSingleNode($xpath)
            $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
        }
        else { # retrieve an element's text
            $resultNode = $node.SelectSingleNode($xpath)
            $text = ?: { $resultNode } { $resultNode.InnerText } { $default }
        }
        # If a regex is given, use it to extract a substring from the text
        if ($regex) {
            if ($text -match $regex) { $text = $matches[1] }
            else { $text = $default }
        }
        return $text
    }
    
    $doc = New-Object HtmlAgilityPack.HtmlDocument
    $result = $doc.Load("tmp\temp.html") # Use this to load a file
    #$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this  PSCX cmdlet to load a live web page
    

    这篇关于使用PowerShell检索HTML中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆