普通EX pression删除HTML标签 [英] Regular expression to remove HTML tags
问题描述
我使用下面的正防爆presion公司从一个字符串中删除HTML标记。它的工作原理只是我离开结束标记。如果我试图删除:< A HREF =嗒嗒>等等< / A>
离开< A />
。
I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a>
it leaves the <a/>
.
我不知道在所有的普通防爆pression语法,并通过这一失手。有人用正则表达式知识,请提供给我,将工作模式。
I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.
下面是我的code:
string sPattern = @"<\/?!?(img|a)[^>]*>";
Regex rgx = new Regex(sPattern);
Match m = rgx.Match(sSummary);
string sResult = "";
if (m.Success)
sResult = rgx.Replace(sSummary, "", 1);
我期待删除&LT的第一次出现; A&GT;
和&LT; IMG&GT;
标签。
推荐答案
使用常规的前pression解析HTML是充满了陷阱。 HTML不是一个普通的语言,因此不可能是100%正确地与一个正则表达式进行解析。这是很多问题你会遇到的只是一个。最好的方法是使用HTML / XML解析器来为你做这个。
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
下面是一个链接到一个博客帖子我写了一段时间回来它进入有关此问题的更多细节。
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
- <一个href="http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-ex$p$pssion-limitations.aspx">http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-ex$p$pssion-limitations.aspx
话虽这么说,这里是应该解决这方面的问题的解决方案。这绝不是一个完美的解决方案,但。
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
这篇关于普通EX pression删除HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!