从字符串中删除HTML的最好方法是什么? [英] What's the best way to remove HTML from a string?

查看:162
本文介绍了从字符串中删除HTML的最好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在ReReplace()函数中开始使用下面的RegEx,使用ColdFusion从字符串中去除HTML标签。 请注意:我并未将此作为XSS或SQL注入的保护措施; 仅用于在HTML标题属性中显示之前从字符串中删除现有和安全的HTML。

  REReplaceNoCase(str,< [^>] *>,,ALL)

在半相关问题以包含空格和换行符。我被告知,为此目的使用RegEx是不合适的,


我强烈怀疑你所发布的正则表达式,实际上工作正常。我建议你不要使用正则表达式来解析HTML,因为HTML不是常规语言。改用HTML解析器。 (标记位置


如果这是真的,在显示之前从字符串中删除HTML的适当工具是什么? (请记住,HTML已经是安全的;它在进入数据库之前已经过处理。)



我知道 HTMLEditFormat() HTMLCodeFormat(),但这两个函数不提供我需要的;较早的用特殊字符替换它们的HTML转义等价物,而后者完全相同,但也包装字符串a < pre> 标签。
$ b

我想要做的是在HTML标题属性中显示之前从HTML和换行符中删除一个字符串< a title =我的字符串没有HTML在这里> ...< / a>



有时候不需要HTML。比如说,你想显示一个没有HTML存储的帖子的摘录。

解决方案

我不同意推理你引用。虽然HTML不应该使用regexen进行解析,



但是你会想要更小心的只是< [^>] *> ,因为那将会变成

  < span title =>> ...< / span> 

插入错误的

 > ...< / span> 

需要类似<([^>] |[^'] *|[^'] *')*> 如果你喜欢一个正则表达式,你可以使用 \\\
(甚至使用交替组合它与上面的,但是这甚至更低效)。


I recently started using the following RegEx in a ReReplace() function to strip HTML tags from a string using ColdFusion. Please note: I am not using this as protection from XSS or SQL injection; this is only to remove existing and safe HTML from a string before it's displayed in an HTML title attribute.

REReplaceNoCase(str,"<[^>]*>","","ALL")

In a semi-related question I asked how to modify my RegEx to include spaces and line breaks. I was told that using RegEx for this purpose is not appropriate and this post was referenced as an explanation.

I strongly suspect though that the regular expressions you have posted don't in fact work correctly. I'd advise you not to use regular expressions to parse HTML as HTML is not a regular language. Use an HTML parser instead. (Mark Byers)

If this is true, what is the appropriate tool for removing HTML from a string before it's displayed? (Baring in mind the HTML is already safe; it's sanitized before entry to the DB).

I am aware of HTMLEditFormat() and HTMLCodeFormat(), but those two functions do not provide what I need; the earlier replaces special characters with their HTML-escaped equivalents, while the latter does exactly the same but also wraps the string a <pre> tag.

What I would like to do is clean a string from HTML and line breaks before I display in an HTML title attribute <a title="My string without HTML goes here">...</a>

There are times when the HTML is not necessary. Say you wanted to display an excerpt from a post without the HTML stored along with it, for instance.

解决方案

I disagree with the reasoning you quote. While HTML should not be parsed with regexen, stripping tags is perfect for them.

But you will want to be more careful than just <[^>]*>, since that would turn

<span title=">">...</span>

into the ill-formed

">...</span>

So you need something like <([^">]|"[^"]*"|'[^']*')*> instead. You can strip out line breaks with character replacement instead of a regex, but if you prefer a regex you can use something like \n (or even combine it with the above using alternation, but that's even less efficient).

这篇关于从字符串中删除HTML的最好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆