正则表达式一个HTML [英] Regex a html
问题描述
大家好,
我有一个小问题,使用正则表达式从html< textarea>获取文本。
这是html源代码,其中包含我想要的信息。
Hello everyone,
I have a little issue using Regular Expression to get the text from a html <textarea>.
This is the html source that has the information I would like to get.
<div id="description-parent" class="msg">
<textarea id="description" class="text meninges" cols="43" rows="8" name="description" type="text" required="required">The Information starts here and continues through to the end of the end.
Bla Bla Bla
Bla Bla Bla
As you can see this informtion is not stored in any formatting.
it is all just plaining text.
Bla Bla Bla
Bla Bla Bla
Lots and lots of information and this is the end.</textarea>
</div>
我可以使用正则表达式获取一行但不是整段的值。
我需要的文字得到的是:
I can use regex to get values on one line but not the whole paragraph.
The text I need to get is all between:
name="description" type="text" required="required">
和
And
</textarea>
这是我正在玩的当前vb.net代码,试图从html源获取文本信息。
This is the current vb.net code that I am playing with to try to get the text information form the html source.
<pre lang="xml">Dim regex As New System.Text.RegularExpressions.Regex("<div id=""description-parent"" class=""msg"">.*") ' I cannot figure out what to place here
Dim matches As MatchCollection = regex.Matches(My.Computer.FileSystem.ReadAllText("D:\temp\source.html").ToString) ' This is the html source
For Each items In matches
Try
MessageBox.Show(items.ToString) ' Once i can place the information into a variable then i can work with it
Catch ex As Exception
MessageBox.Show("Error: " & ex.Message)
End Try
Next
非常感谢任何帮助或建议,我相信我只是忽略了一件事。
Any help or advice is much appreciated, I am sure I am just overlooking one thing.
推荐答案
尝试使用正则表达式从HTML中提取数据在初学者中非常常见,而且在大多数情况下,这是一种方法上的错误。首先,当HTML是格式良好的XML时,这是最常见的情况。在这种情况下,应该使用.NET XML解析器,它们始终可用。这是我对他们的简短回顾:
An attempt of applying Regular Expressions to extract data from HTML is a very usual in the beginners, and, in most cases, is a methodological mistake. First of all, it''s most usual case when HTML is a well-formed XML. In this case, .NET XML parsers should be used, and they are always available. This is my short review of them:
- 使用
System.Xml.XmlDocument
class 。它实现了DOM接口;如果文档的大小不是太大,这种方式是最简单和最好的。
参见 http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx [ ^ ]。 - 使用类
System.Xml.XmlTextReader
;这是最快的阅读方式,特别是你需要跳过一些数据。
参见 http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx [ ^ ]。 - 使用类
System.Xml.Linq.XDocument
;这是类似于XmlDocument
的最合适的方式,支持LINQ to XML Programming。
参见 http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx [ ^ ],http://msdn.microsoft.com/en-us/library/bb387063.aspx [ ^ ]。
- Use
System.Xml.XmlDocument
class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^]. - Use the class
System.Xml.XmlTextReader
; this is the fastest way of reading, especially is you need to skip some data.
See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^]. - Use the class
System.Xml.Linq.XDocument
; this is the most adequate way similar to that ofXmlDocument
, supporting LINQ to XML Programming.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].
在更罕见的情况下,不能假设格式良好的XML。即使这样的情况,可以说,根本没有权利存在,在现实生活中也是如此。比你还需要使用一些可以处理这种情况的HTML解析器。我建议尝试这个: http://www.majestic12.co.uk/projects/html_parser.php [ ^ ]。
你可以尝试找一些: http://bit.ly/15ZhBKr [ ^ ]。
祝你好运,
In more rare cases, well-formed XML cannot be assumed. Even though such cases, so to speak, simply have no right to exist, in real life in happens. Than you still need to use some HTML parser which can deal with such cases. I would advise to try this one: http://www.majestic12.co.uk/projects/html_parser.php[^].
You can try to find some more: http://bit.ly/15ZhBKr[^].
Good luck,
请不要那样做。
http://www.codinghorror.com/blog /2009/11/parsing-html-the-cthulhu-way.html [ ^ ]
Please don''t do that.
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html[^]
这篇关于正则表达式一个HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!