解析 HTML 中的相关标签 [英] Parse through related tags in HTML

查看:47
本文介绍了解析 HTML 中的相关标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从 Powershell 中的 outerHTML 下面提取 item-name、item-manufacturer、item-actual.

I need to extract item-name, item-manufacturer, item-actual from below outerHTML in Powershell.

<DIV class=row>
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A>
    <DIV class=text-small>2 ml</DIV>
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 6</DIV>
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
    <DIV class=row>
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A>
    <DIV class=text-small>28 Tablets</DIV>
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 5.72</DIV>
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">

渲染输出如下所示:

Spasmonil (20mg) - Cipla Limited - Rs. 6
Sprintas (75mg) - Intas Laboratories Pvt - Rs. 5.72

我以非常低效的方式进行操作,我在不同的 txt 文件中得到 4 个输出(drugsname、drugsquan、drugspric、drugmanu),然后我手动将它组合起来.有人可以帮助我以某种优雅的方式做到这一点.

I am doing it in quite in-efficient way and I get 4 outputs (drugsname, drugsquan, drugspric, drugsmanu) in different txt files and I manually combine it afterwards. Can someone help me in doing it in some elegant way.

$regex1 = 'item-name.*?>(.*?)</A>'
$regex2 = 'text-small>(.*?)</DIV>'
$regex3 ='"item-manufacturer visible-xs">(.*?)</DIV>'
$regex4 ='item-actual>(.*?)</DIV>'

$drugsname = $ie.Document.body.outerHTML -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex1){
      $matches[1]      
    }
  }

$drugsquan = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex2){
      $matches[1]      
    }
  }

$drugsmanu = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex3){
      $matches[1]      
    }
  }

$drugspric = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex4){
      $matches[1]      
    }
  }

$drugsname > "d:\users\desktop\HKD\($control)drugsname.txt"
$drugsquan > "d:\users\desktop\HKD\($control)drugsquan.txt"
$drugsmanu > "d:\users\desktop\HKD\($control)drugsmanu.txt"
$drugspric > "d:\users\desktop\HKD\($control)drugspric.txt"

推荐答案

在此处字符串(又名罐头大虾")中使用多行/单行正则表达式:

Using a multi-line/single-line regex in a here-string (aka "jumbo shrimp in a can"):

$data = 
@'
<DIV class=row>
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A>
    <DIV class=text-small>2 ml</DIV>
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 6</DIV>
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
    <DIV class=row>
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A>
    <DIV class=text-small>28 Tablets</DIV>
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 5.72</DIV>
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
'@

[regex]$regex = 
@'
(?ms).*?<DIV class=row>.*?
.+?item-name href=".+?>(.+?)</A>.*?
.+?text-small>(.+?)</DIV>.*?
.+?item-manufacturer.+?>(.+?)</DIV></DIV>.*?
.+?item-actual>(.+?)</DIV>
'@

$regex.Matches($data) |
foreach {
          [PSCustomObject]@{
          Name = $_.Groups[1].value
          Quantity = $_.Groups[2].Value
          Manufacturer = $_.Groups[3].Value
          Price = $_.Groups[4].Value
        }
}

Name                       Quantity                   Manufacturer               Price                    
----                       --------                   ------------               -----                    
Spasmonil (20mg)           2 ml                       Cipla Limited              Rs. 6                    
Sprintas (75mg)            28 Tablets                 Intas Laboratories Pvt Ltd Rs. 5.72                 

现在您有一个对象集合,您可以根据需要对其进行排序、过滤、格式化和导出.

Now you have an object collection you can sort, filter, format and export to suit your needs.

这篇关于解析 HTML 中的相关标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆