正则表达式提取HTML正文内容 [英] Regular Expression to Extract HTML Body Content

查看:221
本文介绍了正则表达式提取HTML正文内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个正则表达式语句,它可以让我从XHTML文档的正文标记之间提取HTML内容。



我需要的XHTML解析将是非常简单的文件,我不必担心JavaScript内容或<![CDATA [标签,例如。



下面是我必须解析的HTML文件的预期结构。由于我完全知道我将要使用的HTML文件的所有内容,因此这段HTML代码几乎涵盖了我的整个用例。如果我能得到一个正则表达式来提取这个例子的身体,我会很高兴。



 <!DOCTYPE HTML PUBLIC  -  // W3C // DTD XHTML 1.0 Strict // EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
< html xmlns =http://www.w3.org/1999/xhtml>
< head>
< title>
< / title>
< / head>
< body contenteditable =true>
< p>
示例段落内容
< / p>
< p>
& nbsp;
< / p>
< p>
< br />
& nbsp;
< / p>
< h1>标题1< / h1>
< / body>
< / html>

从概念上讲,我一直在尝试构建一个匹配所有内容的正则表达式字符串,但内部正文内容。有了这个,我会使用C# Regex.Split()方法来获取正文内容。我认为此正则表达式:



<预类= 朗 - 无prettyprint-越权> ((| \\\
)* LT;主体(。 )*((< / body>(* | \\\
)*)

...会做的伎俩,但它似乎并没有在使用RegexBuddy我的测试内容在所有的工作。


解决方案

 ((?:。(?!< body [^>]>)) 。+<体[^>] *>)|(小于/ body\> +)


$。 b $ b

当然,您需要添加必要的 \ s 以便考虑< body ...> ; (带空格的元素),如下所示:

 ((?:。(?!< ; \s *体[^>] *>))+< \s *体[^>] *>)|(小于。\s * / \s * body\第二个想法,我不知道为什么我需要一个,我不知道为什么我需要一个负面预测...这也应该工作(对于格式良好的xhtml文档): 

 (。*< \s *体[^> ;] *>)|(< \ s * / \s * body \s * \>。+)


I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
    </title>
  </head>
  <body contenteditable="true">
    <p>
      Example paragraph content
    </p>
    <p>
      &nbsp;
    </p>
    <p>
      <br />
      &nbsp;
    </p>
    <h1>Header 1</h1>
  </body>
</html>

Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:

((.|\n)*<body (.)*>)|((</body>(*|\n)*)

...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

解决方案

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

这篇关于正则表达式提取HTML正文内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆