如何过滤除特定白名单之外的所有 HTML 标签? [英] How do I filter all HTML tags except a certain whitelist?

查看：30 发布时间：2021/12/6 10:09:32 c# html vb.net regex

本文介绍了如何过滤除特定白名单之外的所有 HTML 标签?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是用于 .NET.设置了 IgnoreCase 而未设置 MultiLine.

This is for .NET. IgnoreCase is set and MultiLine is NOT set.

通常我擅长正则表达式，也许我的咖啡因不足...

Usually I'm decent at regex, maybe I'm running low on caffeine...

允许用户输入 HTML 编码的实体(<、<amp; 等)，并使用以下 HTML 标签:

Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:

u, i, b, h3, h4, br, a, img

自闭式 <br/>和<img/>允许有或没有额外空间，但不是必需的.

Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.

我想:

去除以上所列之外的所有开始和结束 HTML 标签.
从剩下的标签中移除属性，除了锚点可以有一个href.

Strip all starting and ending HTML tags other than those listed above.
Remove attributes from the remaining tags, except anchors can have an href.

到目前为止我的搜索模式(替换为空字符串):

My search pattern (replaced with an empty string) so far:

<(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>

这似乎去除了除我想要的开始和结束标签之外的所有标签，但存在三个问题:

This seems to be stripping all but the start and end tags I want, but there are three problems:

必须包含每个允许标记的结束标记版本是很丑陋的.
属性继续存在.这会在一次更换中发生吗?
标签以开头，允许的标签名称会漏掉.例如，<abbrev>"和".</li><em class="showen"></em></ol> <ol class="en"> <li>Having to include the end tag version of each allowed tag is ugly.</li> <li>The attributes survive. Can this happen in a single replacement?</li> <li>Tags <em>starting with</em> the allowed tag names slip through. E.g., "<abbrev>" and "<iframe>".</li> </ol> <p class="cn">以下建议的模式不会去除没有属性的标签.<em class="showen"></em></p> <p class="en">The following suggested pattern does not strip out tags that have no attributes.</p> <pre><code><code></?(?!i|b|h3|h4|a|img)[^>]*> </code></code></pre> <p class="cn">如下所述，>"在属性值中是合法的，但可以肯定地说我不会支持.此外，不会有 CDATA 块等需要担心.只是一点点 HTML.<em class="showen"></em></p> <p class="en">As mentioned below, ">" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.</p> <p class="cn">Loophole 的回答是目前最好的，谢谢！这是他的模式(希望 PRE 更适合我):<em class="showen"></em></p> <p class="en">Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):</p> <pre><code><code>static string SanitizeHtml(string html) { string acceptable = "script|link|title"; string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:s[a-zA-Z0-9-]+=?(?:([""']?).*?1?)?)*s*/?>"; return Regex.Replace(html, stringPattern, "sausage"); } </code></code></pre> <p class="cn">我认为仍然可以对此答案进行一些小调整:<em class="showen"></em></p> <p class="en">Some small tweaks I think could still be made to this answer:</p> <ol class="cn"><li><p>我认为这可以修改为捕获简单的 HTML 注释(那些本身不包含标签的注释)，方法是在可接受"变量中添加！--"并对表达式的末尾进行一些小的更改允许可选的尾随s--".<em class="showen"></em></ol> <ol class="en"> <li><p>I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "s--".</ol> <p class="cn">我认为如果属性之间有多个空白字符(例如:重格式 HTML，属性之间带有换行符和制表符)，这会中断.<em class="showen"></em></p> <p class="en">I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).</p> <p class="cn"><strong>编辑 2009-07-23:</strong> 这是我使用的最终解决方案(在 VB.NET 中):<em class="showen"></em></p> <p class="en"><strong>Edit 2009-07-23:</strong> Here's the final solution I went with (in VB.NET):</p> <pre><code><code> Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote" Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _ ")notag|[a-zA-Z0-9]+)(?:s[a-zA-Z0-9-]+=?(?:([""']?).*?1?)?)*s*/?>" html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled) </code></code></pre> <p class="cn">需要注意的是，A 标签的 HREF 属性仍然会被清除，这并不理想.<em class="showen"></em></p> <p class="en">The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.</p> <h3 class="best_ans mt-1">推荐答案</h3> <p class="cn">这是我为此任务编写的函数:<em class="showen"></em></p> <p class="en">Here's a function I wrote for this task:</p> <pre><code><code>static string SanitizeHtml(string html) { string acceptable = "script|link|title"; string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:s[a-zA-Z0-9-]+=?(?:(["",']?).*?1?)?)*s*/?>"; return Regex.Replace(html, stringPattern, "sausage"); } </code></code></pre> <p class="cn">出于某种原因，我将之前的更正作为单独的答案发布，因此我将它们合并到这里.<em class="showen"></em></p> <p class="en"> For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.</p> <p class="cn">我将解释一下正则表达式，因为它有点长.<em class="showen"></em></p> <p class="en">I will explain the regex a bit, because it is a little long.</p> <p class="cn">第一部分匹配一个左括号和 0 或 1 个斜线(以防它是一个结束标记).<em class="showen"></em></p> <p class="en">The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).</p> <p class="cn">接下来，您会看到一个 if-then 结构，并向前看.(?(?=SomeTag)then|else) 我正在检查字符串的下一部分是否是可接受的标签之一.您可以看到我将正则表达式字符串与可接受的变量连接起来，该变量是由垂直条分隔的可接受的标签名称，以便任何术语都匹配.如果匹配，你可以看到我输入了notag"这个词，因为没有标签可以匹配它，如果可以接受，我想不理会它.否则，我将转到 else 部分，在那里我匹配任何标签名称 [a-z,A-Z,0-9]+<em class="showen"></em></p> <p class="en">Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+</p> <p class="cn">接下来，我想匹配 0 个或多个属性，我假设这些属性的格式为 attribute="value".所以现在我将代表一个属性的这部分分组，但我使用 ?: 来防止这个组被捕获以提高速度: (?:s[az,AZ,0-9,-]+=?(?:(["",']?).<em>?1?))</em><em class="showen"></em></p> <p class="en">Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:s[a-z,A-Z,0-9,-]+=?(?:(["",']?).<em>?1?))</em></p> <p class="cn">这里我从标记名和属性名之间的空白字符开始，然后匹配一个属性名:[a-z,A-Z,0-9,-]+<em class="showen"></em></p> <p class="en">Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+</p> <p class="cn">接下来我匹配一个等号，然后匹配一个引号.我将引用分组以便将其捕获，稍后我可以进行反向引用 1 以匹配相同类型的引用.在这两个引号之间，您可以看到我使用句点来匹配任何内容，但是我使用的是惰性版本 *?而不是贪婪版本 * 以便它只匹配将结束该值的下一个引用.<em class="showen"></em></p> <p class="en">next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later 1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.</p> <p class="cn">接下来我们在用括号关闭组后放一个 * 以便它匹配多个属性/值组合(或没有).最后，我们将一些空格与 s 和 0 或 1 个结束斜杠匹配在标签中，用于 xml 样式的自关闭标签.<em class="showen"></em></p> <p class="en">next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with s, and 0 or 1 ending slashes in the tag for xml style self closing tags.</p> <p class="cn">你可以看到我正在用香肠替换标签，因为我饿了，但你也可以用空字符串替换它们来清除它们.<em class="showen"></em></p> <p class="en">You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.</p> <p>这篇关于如何过滤除特定白名单之外的所有 HTML 标签?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！</p> </div> <div class="arc-body-main-more"> <span onclick="unlockarc('2639715');">查看全文</span> </div> </div> <div> </div> <div class="wwads-cn wwads-horizontal" data-id="166" style="max-width:100%;border: 4px solid #666;"></div> </div> </article> <div id="arc-ad-2" class="mb-1"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-5038752844014834" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5038752844014834" data-ad-slot="3921941283"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="widget bgwhite radius-1 mb-1 shadow widget-rel"> <h5>相关文章</h5> <ul> <li> <a target="_blank" title="如何过滤所有的HTML标签，除了一定的白名单中？" href="/375239.html"> 如何过滤所有的HTML标签，除了一定的白名单中？; </a> </li> <li> <a target="_blank" title="symfony2 twig 白名单 html 标签" href="/2853524.html"> symfony2 twig 白名单 html 标签; </a> </li> <li> <a target="_blank" title="白名单。" href="/1255021.html"> 白名单。; </a> </li> <li> <a target="_blank" title="HTML Agility Pack 条带标签不在白名单中" href="/2838337.html"> HTML Agility Pack 条带标签不在白名单中; </a> </li> <li> <a target="_blank" title="HTML敏捷性包带标签NOT IN白名单" href="/374312.html"> HTML敏捷性包带标签NOT IN白名单; </a> </li> <li> <a target="_blank" title="使用Go的html /模板，白名单标签免于逃脱" href="/809909.html"> 使用Go的html /模板，白名单标签免于逃脱; </a> </li> <li> <a target="_blank" title="基于白名单编写（X）HTML的XSS过滤器" href="/756780.html"> 基于白名单编写（X）HTML的XSS过滤器; </a> </li> <li> <a target="_blank" title="PHP实现数组的白名单过滤" href="/731768.html"> PHP实现数组的白名单过滤; </a> </li> <li> <a target="_blank" title="将面料和白名单列入白名单Crashlytics IP" href="/2019416.html"> 将面料和白名单列入白名单Crashlytics IP; </a> </li> <li> <a target="_blank" title="Ruby 哈希白名单过滤器" href="/2785140.html"> Ruby 哈希白名单过滤器; </a> </li> <li> <a target="_blank" title="白名单与设计" href="/2781297.html"> 白名单与设计; </a> </li> <li> <a target="_blank" title="MySQL白名单查询" href="/2311991.html"> MySQL白名单查询; </a> </li> <li> <a target="_blank" title="Nginx Ip白名单" href="/1679923.html"> Nginx Ip白名单; </a> </li> <li> <a target="_blank" title="Webapp IP白名单" href="/1231927.html"> Webapp IP白名单; </a> </li> <li> <a target="_blank" title="Nginx ip 白名单" href="/2667773.html"> Nginx ip 白名单; </a> </li> <li> <a target="_blank" title="如何在PHP中基于白名单的CSS过滤" href="/564517.html"> 如何在PHP中基于白名单的CSS过滤; </a> </li> <li> <a target="_blank" title="Firebase会说“域未列入白名单".对于列入白名单的链接" href="/2081590.html"> Firebase会说“域未列入白名单".对于列入白名单的链接; </a> </li> <li> <a target="_blank" title="Firebase 显示“域未列入白名单"对于列入白名单的链接" href="/2755702.html"> Firebase 显示“域未列入白名单"对于列入白名单的链接; </a> </li> <li> <a target="_blank" title="如何使用 PHP 在白名单中允许 HTML" href="/2707093.html"> 如何使用 PHP 在白名单中允许 HTML; </a> </li> <li> <a target="_blank" title="为什么使用白名单进行HTML消毒？" href="/862303.html"> 为什么使用白名单进行HTML消毒？; </a> </li> <li> <a target="_blank" title="删除基于白名单的元素的所有属性" href="/1486792.html"> 删除基于白名单的元素的所有属性; </a> </li> <li> <a target="_blank" title="红宝石哈希白名单过滤器" href="/846394.html"> 红宝石哈希白名单过滤器; </a> </li> <li> <a target="_blank" title="使用白名单安全地将HTML标签放在javascript中" href="/652429.html"> 使用白名单安全地将HTML标签放在javascript中; </a> </li> <li> <a target="_blank" title="白名单与色器件" href="/307319.html"> 白名单与色器件; </a> </li> <li> <a target="_blank" title="Pytesseract集字符白名单" href="/2882042.html"> Pytesseract集字符白名单; </a> </li> </ul> </div> <div class="mb-1"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-5038752844014834" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5038752844014834" data-ad-slot="3921941283"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="side"> <div class="widget widget-side bgwhite mb-1 shadow"> <h5>C#/.NET最新文章</h5> <ul> <li> <a target="_blank" title="smtp.live.com - 邮箱不可用。服务器响应为：5.7.3请求的操作中止;用户未通过身份验证" href="/444094.html"> smtp.live.com - 邮箱不可用。服务器响应为：5.7.3请求的操作中止;用户未通过身份验证; </a> </li> <li> <a target="_blank" title="C＃WinForms应用程序失败在发送电子邮件：远程名称无法解析：'smtp.gmail.com;操作超时" href="/32030.html"> C＃WinForms应用程序失败在发送电子邮件：远程名称无法解析：'smtp.gmail.com;操作超时; </a> </li> <li> <a target="_blank" title="Windows应用程序已停止工作::事件名称CLR20r3" href="/7498.html"> Windows应用程序已停止工作::事件名称CLR20r3; </a> </li> <li> <a target="_blank" title="如何设置的WebAPI控制器的multipart / form-data的" href="/294641.html"> 如何设置的WebAPI控制器的multipart / form-data的; </a> </li> <li> <a target="_blank" title="无法反序列化当前的JSON数组（例如[1,2,3]）" href="/240450.html"> 无法反序列化当前的JSON数组（例如[1,2,3]）; </a> </li> <li> <a target="_blank" title="如何设置一个HttpClient的请求Content-Type头？" href="/221353.html"> 如何设置一个HttpClient的请求Content-Type头？; </a> </li> <li> <a target="_blank" title="为什么发现“没有匹配请求URI的HTTP资源”这里？" href="/547804.html"> 为什么发现“没有匹配请求URI的HTTP资源”这里？; </a> </li> <li> <a target="_blank" title="如何设置一个重试次数在RabbitMQ的呢？" href="/10344.html"> 如何设置一个重试次数在RabbitMQ的呢？; </a> </li> <li> <a target="_blank" title="任务＆LT;＆GT;不包含'GetAwaiter“的定义" href="/300688.html"> 任务＆LT;＆GT;不包含'GetAwaiter“的定义; </a> </li> <li> <a target="_blank" title="这是不可能连接到redis的服务器（S）;以创建断开连接多路复用器" href="/421232.html"> 这是不可能连接到redis的服务器（S）;以创建断开连接多路复用器; </a> </li> </ul> </div> <div class="widget widget-side bgwhite mb-1 shadow"> <h5> 热门教程 </h5> <ul> <li> <a target="_blank" title="Java教程" href="/OnLineTutorial/java/index.html"> Java教程 </a> </li> <li> <a target="_blank" title="Apache ANT 教程" href="/OnLineTutorial/ant/index.html"> Apache ANT 教程 </a> </li> <li> <a target="_blank" title="Kali Linux教程" href="/OnLineTutorial/kali_linux/index.html"> Kali Linux教程 </a> </li> <li> <a target="_blank" title="JavaScript教程" href="/OnLineTutorial/javascript/index.html"> JavaScript教程 </a> </li> <li> <a target="_blank" title="JavaFx教程" href="/OnLineTutorial/javafx/index.html"> JavaFx教程 </a> </li> <li> <a target="_blank" title="MFC 教程" href="/OnLineTutorial/mfc/index.html"> MFC 教程 </a> </li> <li> <a target="_blank" title="Apache HTTP客户端教程" href="/OnLineTutorial/apache_httpclient/index.html"> Apache HTTP客户端教程 </a> </li> <li> <a target="_blank" title="Microsoft Visio 教程" href="/OnLineTutorial/microsoft_visio/index.html"> Microsoft Visio 教程 </a> </li> </ul> </div> <div class="widget widget-side bgwhite mb-1 shadow"> <h5> 热门工具 </h5> <ul> <li> <a target="_blank" title="Java 在线工具" href="/Onlinetools/details/4"> Java 在线工具 </a> </li> <li> <a target="_blank" title="C(GCC) 在线工具" href="/Onlinetools/details/6"> C(GCC) 在线工具 </a> </li> <li> <a target="_blank" title="PHP 在线工具" href="/Onlinetools/details/8"> PHP 在线工具 </a> </li> <li> <a target="_blank" title="C# 在线工具" href="/Onlinetools/details/1"> C# 在线工具 </a> </li> <li> <a target="_blank" title="Python 在线工具" href="/Onlinetools/details/5"> Python 在线工具 </a> </li> <li> <a target="_blank" title="MySQL 在线工具" href="/Onlinetools/Dbdetails/33"> MySQL 在线工具 </a> </li> <li> <a target="_blank" title="VB.NET 在线工具" href="/Onlinetools/details/2"> VB.NET 在线工具 </a> </li> <li> <a target="_blank" title="Lua 在线工具" href="/Onlinetools/details/14"> Lua 在线工具 </a> </li> <li> <a target="_blank" title="Oracle 在线工具" href="/Onlinetools/Dbdetails/35"> Oracle 在线工具 </a> </li> <li> <a target="_blank" title="C++(GCC) 在线工具" href="/Onlinetools/details/7"> C++(GCC) 在线工具 </a> </li> <li> <a target="_blank" title="Go 在线工具" href="/Onlinetools/details/20"> Go 在线工具 </a> </li> <li> <a target="_blank" title="Fortran 在线工具" href="/Onlinetools/details/45"> Fortran 在线工具 </a> </li> </ul> </div> </div> </div> <script type="text/javascript">var eskeys = '如何,过,滤除,特定,白名单,之外,的,所有,html,标签'; var cat = 'cc';';//c</script> </div> <div id="pop" onclick="pophide();"> <div id="pop_body" onclick="event.stopPropagation();"> <h6 class="flex flex101"> 登录 <span onclick="pophide();">关闭</span> </h6> <div class="pd-1"> <div class="wxtip center"> <span>扫码关注<em>1秒</em>登录</span> </div> <div class="center"> <img id="qr" src="https://huajiakeji.com/Content/Images/qrydx.jpg" alt="" style="width:150px;height:150px;" /> </div> <div style="margin-top:10px;display:flex;justify-content: center;"> <input type="text" placeholder="输入验证码" id="txtcode" autocomplete="off" /> <input id="btngo" type="button" onclick="chk()" value="GO" /> </div> <div class="center" style="margin: 4px; font-size: .8rem; color: #f60;"> 发送“验证码”获取 <em style="padding: 0 .5rem;">|</em> <span style="color: #01a05c;">15天全站免登陆</span> </div> <div id="chkinfo" class="tip"></div> </div> </div> </div> <script type="text/javascript" src="https://lib.sinaapp.com/js/jquery/1.9.1/jquery-1.9.1.min.js"></script> <script type="text/javascript" src="https://cdn.bootcss.com/jquery-cookie/1.4.1/jquery.cookie.min.js"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/highlight.min.js"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/base.js?v=0.22"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/tui.js?v=0.11"></script> <footer class="footer"> <div class="container"> <div class="flink mb-1"> 友情链接： <a href="https://www.it1352.com/" target="_blank">IT屋</a> <a href="https://huajiakeji.com/" target="_blank">Chrome插件</a> <a href="https://www.cnplugins.com/" target="_blank">谷歌浏览器插件</a> </div> <section class="copyright-section"> <a href="https://www.it1352.com" title="IT屋-程序员软件开发技术分享社区">IT屋</a> ©2016-2022 <a href="http://www.beian.miit.gov.cn/" target="_blank">琼ICP备2021000895号-1</a> <a href="/sitemap.html" target="_blank" title="站点地图">站点地图</a> <a href="/Home/Tags" target="_blank" title="站点标签">站点标签</a> <a target="_blank" alt="sitemap" href="/sitemap.xml">SiteMap</a> <a href="/1155981.html" title="IT屋-免责申明"><免责申明></a> 本站内容来源互联网,如果侵犯您的权益请联系我们删除. </section>  <script type="text/javascript"> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?0c3a090f7b3c4ad458ac1296cb5cc779"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> <script type="text/javascript"> (function () { var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </div> </footer> </body> </html>