正则表达式 - 删除HTML注释跨越多个换行符 [英] Regular Expression - Remove HTML comment spanning multiple line breaks

查看:424
本文介绍了正则表达式 - 删除HTML注释跨越多个换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用这个脚本:

http://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text

要某些Outlook HTML转换为纯文本。

To convert some outlook HTML to plain text.

据近的作品,它留下的唯一的事情是CSS这在HTML注释标记前景的地方<! - - > 除了<风格> 标签(被删除)

It nearly works, the only thing that it leaves behind is the CSS which outlook places in html comment tags <!-- --> in addition to <style> tags (which are removed)

这是原文:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">tesst<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:dimgray;mso-fareast-language:EN-GB">JOE BLOGS</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:dimgray;mso-fareast-language:EN-GB">
</div>
</body>
</html>



这是生成的文本:(注意HTML注释一直没有删除)

This is the resulting text: (note the HTML comment has not been removed)

<!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
    {page:WordSection1;}
-->

tesst
&nbsp;
JOE BLOGS



我试图调整与附加取代StripHTML()函数 - 但是这些没有任何工作。

I have tried adapting the StripHTML() function with the additional replaces - but these did not work either.

result = System.Text.RegularExpressions.Regex.Replace(result, "(<!--).*?(-->)", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<!--*-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

请帮忙 - 这是我一直以来的午餐停留在2分钟的工作的 facedesk

Please help - this was a 2 minute job that i've been stuck on since lunch facedesk

干杯

修改1 :也试过以下 - 仍然没有喜悦

Edit 1: also tried the following - still no joy

result = System.Text.RegularExpressions.Regex.Replace(result, "<!--.*-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<!--.*?-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

编辑2:我注意到这个问题是越来越很多意见,任何人读这一定要三思而后行采取正则表达式的方法,而不是我推荐使用天猫(基于开放源代码的文本浏览器)的HTML转换为纯文本,我问过类似的问题的here 以及我公司提供基于应该让你的答案在编辑示例代码开始使用lynx.exe从.NET应用程序中。这是我们最后使用,并没有因为任何问题的方法。

Edit 2: I noticed this question was getting a lot of views, anyone reading this should definitely think twice about taking the regExp approach, instead i recommend using Lynx (OpenSource text based browser) to convert HTML to plain text, i asked a similar question here and i provide sample code in the edits based on the answers that should get you started using lynx.exe from within a .net application. This is the method we ended up using and haven't had any problems since.

推荐答案

有三个原因你的第二个正则表达式:

Your second regular expression for three reasons:


  • 您需要使用来匹配任何字符。

  • * 是贪婪的。你想 *?懒洋洋地匹配。

  • 您需要 RegexOptions.Singleline

  • You need to use . to match any character.
  • The * is greedy. You want *? to match lazily.
  • You need RegexOptions.Singleline.

试试这个:

result = Regex.Replace(result, "<!--.*?-->", "", RegexOptions.Singleline);



我强烈建议您不要使用正则表达式解析HTML。你会救自己痛苦的整个世界,如果你改用 HTML敏捷性包

这篇关于正则表达式 - 删除HTML注释跨越多个换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆