使用 RegEx 提取 html 标签之间的文本 [英] use RegEx to extract text between html tags

查看:41
本文介绍了使用 RegEx 提取 html 标签之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从visual basic中的字符串中提取一些文本,如下所示:

<h2 id="id-date">09.09.2010</h2>,这里提取日期<h3 id="nr">000</h3>,这里是一个数字</div>

我需要从 div 中提取日期,并从 div 中提取所有这些数字...此外,这将在循环中,这意味着需要解析更多的 div 块.!谢谢你!阿德里安

解决方案

使用正则表达式解析 HTML 并不理想.其他人建议使用 HTML Agility Pack.但是,如果您能保证您的输入是明确定义的并且您始终知道会发生什么,那么使用正则表达式是可能的.

如果你能做出这样的保证,请继续阅读.否则,您需要考虑其他建议或更好地定义您的输入.事实上,你应该更好地定义你的输入,因为我的回答做了一些假设.一些需要考虑的问题:

  • HTML 是一行还是多行,由换行符分隔?
  • HTML 是否总是采用 <div>...<h2...>...</h2><h3...>...< 的形式;/h3>

?或者可以有 h1-h6 标签吗?

  • hN 标签之上,日期和数字是否总是位于带有 id-datenr 值的标签之间?id 属性?
  • 根据这些问题的答案,模式可能会发生变化.以下代码假定每个 HTML 片段都遵循您共享的结构,它将分别具有带有日期和数字的 h2h3,并且每个标签都将位于新队.如果你给它提供不同的输入,它可能会中断,直到模式与你的输入结构匹配.

    Dim input As String = "

    "&Environment.Newline &_"<h2 id=""id-date"">09.09.2010</h2>"&Environment.Newline &_"<h3 id=""nr"">000</h3>"&Environment.Newline &_</div>"Dim pattern As String = "<div[^>]+>.*?"&_"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?"&_"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)如果 m.Success 那么Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)Console.WriteLine("解析日期:" & m.Groups("Date").Value)Console.WriteLine("实际日期:" & actualDate)Console.WriteLine("解析数:" & m.Groups("Number").Value)Console.WriteLine("实际编号:" & actualNumber)别的Console.WriteLine("不匹配!")万一

    该模式可以在一行上,但为了清楚起见,我将其分开.RegexOptions.Singleline 用于允许 . 元字符处理 \n 换行符.

    你还说:

    <块引用>

    此外,这将在循环中,意思是需要更多的 div 块解析.

    您是否在单独的字符串上循环?或者您是否希望在单个字符串中多次出现上述 HTML 结构?如果是前者,则应将上述代码应用于每个字符串.对于后者,您需要使用 Regex.Matches 并将每个 Match 结果与上面的代码类似.

    <小时>

    这里是一些示例代码来演示解析多次出现.

    Dim input As String = "

    "&Environment.Newline &_"<h2 id=""id-date"">09.09.2010</h2>"&Environment.Newline &_"<h3 id=""nr"">000</h3>"&Environment.Newline &_</div>"&_"<div id=""div"">"&Environment.Newline &_"<h2 id=""id-date"">09.14.2010</h2>"&Environment.Newline &_"<h3 id=""nr"">123</h3>"&Environment.Newline &_</div>"Dim pattern As String = "<div[^>]+>.*?"&_"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?"&_"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)Console.WriteLine("解析日期:" & m.Groups("Date").Value)Console.WriteLine("实际日期:" & actualDate)Console.WriteLine("解析数:" & m.Groups("Number").Value)Console.WriteLine("实际编号:" & actualNumber)下一个

    I have to extract from a string in visual basic some text, like this:

    <div id="div">
    <h2 id="id-date">09.09.2010</h2> , here to extract the date 
    
    <h3 id="nr">000</h3> , here a number </div>
    

    I need to extract the date from the div and the number all this from within the div... Also and this will be in loop, meaning there are more div block needed to be parsed.! thank you! Adrian

    解决方案

    Parsing HTML with regex is not ideal. Others have suggested the HTML Agility Pack. However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.

    If you can make that guarantee, read on. Otherwise you need to consider the other suggestions or define your input better. In fact, you should define your input better regardless because my answer makes a few assumptions. Some questions to consider:

    • Will the HTML be on one line or multiple lines, separated by newline characters?
    • Will the HTML always be in the form of <div>...<h2...>...</h2><h3...>...</h3></div>? Or can there be h1-h6 tags?
    • On top of the hN tags, will the date and number always be between the tags with id-date and nr values for the id attribute?

    Depending on the answers to these questions the pattern can change. The following code assumes each HTML fragment follows the structure you shared, that it will have an h2 and h3 with date and number, respectively, and that each tag will be on a new line. If you feed it different input it will likely break till the pattern matches your input's structure.

    Dim input As String = "<div id=""div"">" & Environment.Newline & _
                   "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
                   "<h3 id=""nr"">000</h3>" & Environment.Newline & _
                   "</div>"
    
    Dim pattern As String = "<div[^>]+>.*?" & _
                     "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                     "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
    
    Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)
    
    If m.Success Then
        Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
        Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
        Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
        Console.WriteLine("Actual Date: " & actualDate)
        Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
        Console.WriteLine("Actual Number: " & actualNumber)
    Else
        Console.WriteLine("No match!")
    End If
    

    The pattern can be on one line but I broke it up for clarity. RegexOptions.Singleline is used to allow the . metacharacter to handle \n for newlines.

    You also said:

    Also and this will be in loop, meaning there are more div block needed to be parsed.

    Are you looping over separate strings? Or are you expecting multiple occurrences of the above HTML structure in a single string? If the former, the above code should be applied to each string. For the latter you'll want to use Regex.Matches and treat each Match result similarly to the above piece of code.


    EDIT: here is some sample code to demonstrate parsing multiple occurrences.

    Dim input As String = "<div id=""div"">" & Environment.Newline & _
                   "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
                   "<h3 id=""nr"">000</h3>" & Environment.Newline & _
                   "</div>" & _
                   "<div id=""div"">" & Environment.Newline & _
                   "<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
                   "<h3 id=""nr"">123</h3>" & Environment.Newline & _
                   "</div>"
    
    Dim pattern As String = "<div[^>]+>.*?" & _
                     "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                     "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
    
    For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
        Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
        Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
        Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
        Console.WriteLine("Actual Date: " & actualDate)
        Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
        Console.WriteLine("Actual Number: " & actualNumber)
    Next
    

    这篇关于使用 RegEx 提取 html 标签之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆