正则表达式一个HTML [英] Regex a html

查看:94
本文介绍了正则表达式一个HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,



我有一个小问题,使用正则表达式从html< textarea>获取文本。

这是html源代码,其中包含我想要的信息。



Hello everyone,

I have a little issue using Regular Expression to get the text from a html <textarea>.
This is the html source that has the information I would like to get.

<div id="description-parent" class="msg">
  <textarea id="description" class="text meninges" cols="43" rows="8" name="description" type="text" required="required">The Information starts here and continues through to the end of the end.


  Bla Bla Bla
  
Bla Bla Bla

  
 As you can see this informtion is not stored in any formatting.
 
 it is all just plaining text.

 
      Bla Bla Bla

  Bla Bla Bla

      
Lots and lots of information and this is the end.</textarea>
</div>





我可以使用正则表达式获取一行但不是整段的值。

我需要的文字得到的是:





I can use regex to get values on one line but not the whole paragraph.
The text I need to get is all between:

name="description" type="text" required="required">






And

</textarea>





这是我正在玩的当前vb.net代码,试图从html源获取文本信息。





This is the current vb.net code that I am playing with to try to get the text information form the html source.

<pre lang="xml">Dim regex As New System.Text.RegularExpressions.Regex("<div id=""description-parent"" class=""msg"">.*") ' I cannot figure out what to place here
Dim matches As MatchCollection = regex.Matches(My.Computer.FileSystem.ReadAllText("D:\temp\source.html").ToString) ' This is the html source

For Each items In matches
    Try
        MessageBox.Show(items.ToString) ' Once i can place the information into a variable then i can work with it
    Catch ex As Exception
        MessageBox.Show("Error: " & ex.Message)
    End Try
Next





非常感谢任何帮助或建议,我相信我只是忽略了一件事。



Any help or advice is much appreciated, I am sure I am just overlooking one thing.

推荐答案

尝试使用正则表达式从HTML中提取数据在初学者中非常常见,而且在大多数情况下,这是一种方法上的错误。首先,当HTML是格式良好的XML时,这是最常见的情况。在这种情况下,应该使用.NET XML解析器,它们始终可用。这是我对他们的简短回顾:

An attempt of applying Regular Expressions to extract data from HTML is a very usual in the beginners, and, in most cases, is a methodological mistake. First of all, it''s most usual case when HTML is a well-formed XML. In this case, .NET XML parsers should be used, and they are always available. This is my short review of them:


  1. 使用 System.Xml.XmlDocument class 。它实现了DOM接口;如果文档的大小不是太大,这种方式是最简单和最好的。
    参见 http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx [ ^ ]。
  2. 使用类 System.Xml.XmlTextReader ;这是最快的阅读方式,特别是你需要跳过一些数据。
    参见 http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx [ ^ ]。
  3. 使用类 System.Xml.Linq.XDocument ;这是类似于 XmlDocument 的最合适的方式,支持LINQ to XML Programming。
    参见 http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx [ ^ ],http://msdn.microsoft.com/en-us/library/bb387063.aspx [ ^ ]。

  1. Use System.Xml.XmlDocument class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^].
  2. Use the class System.Xml.XmlTextReader; this is the fastest way of reading, especially is you need to skip some data.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^].
  3. Use the class System.Xml.Linq.XDocument; this is the most adequate way similar to that of XmlDocument, supporting LINQ to XML Programming.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].





在更罕见的情况下,不能假设格式良好的XML。即使这样的情况,可以说,根本没有权利存在,在现实生活中也是如此。比你还需要使用一些可以处理这种情况的HTML解析器。我建议尝试这个: http://www.majestic12.co.uk/projects/html_parser.php [ ^ ]。



你可以尝试找一些: http://bit.ly/15ZhBKr [ ^ ]。



祝你好运,

-SA



In more rare cases, well-formed XML cannot be assumed. Even though such cases, so to speak, simply have no right to exist, in real life in happens. Than you still need to use some HTML parser which can deal with such cases. I would advise to try this one: http://www.majestic12.co.uk/projects/html_parser.php[^].

You can try to find some more: http://bit.ly/15ZhBKr[^].

Good luck,

—SA


请不要那样做。



http://www.codinghorror.com/blog /2009/11/parsing-html-the-cthulhu-way.html [ ^ ]
Please don''t do that.

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html[^]


这篇关于正则表达式一个HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆