正则表达式提取< div>的内容标签 [英] Regex to extract the contents of a <div> tag

查看:89
本文介绍了正则表达式提取< div>的内容标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有点大脑冻结,所以我希望有一些指针,从本质上讲,我需要提取特定div标签的内容,是的,我知道正则表达式通常不被批准用于此,但是它是一个简单的Web没有嵌套div的抓取应用程序.

Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.

我正在尝试与此匹配:

    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>

简单的vb代码如下:

    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next

任何帮助将不胜感激.

推荐答案

您的正则表达式适用于您的示例.但是,应该进行一些改进:

Your regex works for your example. There are some improvements that should be made, though:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]*的意思是匹配除尖括号之外的任意数量的字符",以确保我们不会意外脱离所处的标签.

[^<>]* means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

.*?(请注意?)的意思是匹配任意数量的字符,但只能匹配尽可能少的字符".这样可以避免从页面的第一个<div class="entry">标签到最后一个匹配.

.*? (note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry"> tag in your page.

但是您的正则表达式本身仍应匹配某物.也许您没有正确使用它?

But your regex itself should still have matched something. Perhaps you're not using it correctly?

我不了解Visual Basic,所以这只是黑暗中的一枪,但是RegexBuddy建议采用以下方法:

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

我建议不要再使用正则表达式方法.如果坚持的话,最终将得到如下所示的怪异正则表达式,只有当div内容的形式永远不变时,该表达式才起作用:

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

或(如VB.NET中多行字符串的喜悦):

or (behold the joy of multiline strings in VB.NET):

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(当然,现在您需要存储MatchResult.Groups("title")等的结果...)

(Of course, now you need to store the results for MatchResult.Groups("title") etc...)

这篇关于正则表达式提取&lt; div&gt;的内容标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆