正则表达式提取HTML正文内容 [英] Regular Expression to Extract HTML Body Content

查看：221 发布时间：2018/6/13 16:57:23 c# html regex xhtml

本文介绍了正则表达式提取HTML正文内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一个正则表达式语句，它可以让我从XHTML文档的正文标记之间提取HTML内容。

我需要的XHTML解析将是非常简单的文件，我不必担心JavaScript内容或<！[CDATA [标签，例如。

）

下面是我必须解析的HTML文件的预期结构。由于我完全知道我将要使用的HTML文件的所有内容，因此这段HTML代码几乎涵盖了我的整个用例。如果我能得到一个正则表达式来提取这个例子的身体，我会很高兴。

 <！DOCTYPE HTML PUBLIC  -  // W3C // DTD XHTML 1.0 Strict // EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"> 
< html xmlns =http://www.w3.org/1999/xhtml> 
< head> 
< title> 
< / title> 
< / head> 
< body contenteditable =true> 
< p> 
示例段落内容
< / p> 
< p> 
& nbsp; 
< / p> 
< p> 
< br /> 
& nbsp; 
< / p> 
< h1>标题1< / h1> 
< / body> 
< / html>

从概念上讲，我一直在尝试构建一个匹配所有内容的正则表达式字符串，但内部正文内容。有了这个，我会使用C＃ Regex.Split（）方法来获取正文内容。我认为此正则表达式：

<预类= 朗 - 无prettyprint-越权>

（（| \\\
）* LT;主体（。 ）*（（< / body>（* | \\\
）*）

...会做的伎俩，但它似乎并没有在使用RegexBuddy我的测试内容在所有的工作。

解决方案

 （（？：。（？！< body [^>]>）） 。+<体[^>] *>）|（小于/ body\> +）

$。 b $ b

当然，您需要添加必要的 \ s 以便考虑< body ...> ; （带空格的元素），如下所示：

 （（？：。（？！< ; \s *体[^>] *>））+< \s *体[^>] *>）|（小于。\s * / \s * body\第二个想法，我不知道为什么我需要一个，我不知道为什么我需要一个负面预测...这也应该工作（对于格式良好的xhtml文档）： 
 
 
 （。*< \s *体[^> ;] *>）|（< \ s * / \s * body \s * \>。+）
  
 
I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
    </title>
  </head>
  <body contenteditable="true">
    <p>
      Example paragraph content
    </p>
    <p>
      &nbsp;
    </p>
    <p>
      <br />
      &nbsp;
    </p>
    <h1>Header 1</h1>
  </body>
</html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.
 解决方案 
Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s in order to take into account <  body ...> (element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)


                        
这篇关于正则表达式提取HTML正文内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

正则表达式提取HTML正文内容 [英] Regular Expression to Extract HTML Body Content

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

正则表达式提取HTML正文内容 [英] Regular Expression to Extract HTML Body Content

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭