正则表达式提取HTML正文内容 [英] Regular Expression to Extract HTML Body Content
问题描述
我正在寻找一个正则表达式语句,它可以让我从XHTML文档的正文标记之间提取HTML内容。
我需要的XHTML解析将是非常简单的文件,我不必担心JavaScript内容或<![CDATA [
标签,例如。
下面是我必须解析的HTML文件的预期结构。由于我完全知道我将要使用的HTML文件的所有内容,因此这段HTML代码几乎涵盖了我的整个用例。如果我能得到一个正则表达式来提取这个例子的身体,我会很高兴。
<!DOCTYPE HTML PUBLIC - // W3C // DTD XHTML 1.0 Strict // EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
< html xmlns =http://www.w3.org/1999/xhtml>
< head>
< title>
< / title>
< / head>
< body contenteditable =true>
< p>
示例段落内容
< / p>
< p>
& nbsp;
< / p>
< p>
< br />
& nbsp;
< / p>
< h1>标题1< / h1>
< / body>
< / html>
从概念上讲,我一直在尝试构建一个匹配所有内容的正则表达式字符串,但内部正文内容。有了这个,我会使用C# Regex.Split()
方法来获取正文内容。我认为此正则表达式:
<预类= 朗 - 无prettyprint-越权>
((| \\\
)* LT;主体(。 )*((< / body>(* | \\\
)*)
...会做的伎俩,但它似乎并没有在使用RegexBuddy我的测试内容在所有的工作。
((?:。(?!< body [^>]>)) 。+<体[^>] *>)|(小于/ body\> +)
$。 b $ b
当然,您需要添加必要的 \ s
以便考虑< body ...> ;
(带空格的元素),如下所示:
((?:。(?!< ; \s *体[^>] *>))+< \s *体[^>] *>)|(小于。\s * / \s * body\第二个想法,我不知道为什么我需要一个,我不知道为什么我需要一个负面预测...这也应该工作(对于格式良好的xhtml文档):
(。*< \s *体[^> ;] *>)|(< \ s * / \s * body \s * \>。+)
I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[
tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
</p>
<p>
<br />
</p>
<h1>Header 1</h1>
</body>
</html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split()
method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.
解决方案 Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s
in order to take into account < body ...>
(element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
这篇关于正则表达式提取HTML正文内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!