从字符串中删除HTML的最好方法是什么? [英] What's the best way to remove HTML from a string?
问题描述
我最近在ReReplace()函数中开始使用下面的RegEx,使用ColdFusion从字符串中去除HTML标签。 请注意:我并未将此作为XSS或SQL注入的保护措施; 仅用于在HTML标题属性中显示之前从字符串中删除现有和安全的HTML。
REReplaceNoCase(str,< [^>] *>,,ALL)
在半相关问题以包含空格和换行符。我被告知,为此目的使用RegEx是不合适的, 我强烈怀疑你所发布的正则表达式,实际上工作正常。我建议你不要使用正则表达式来解析HTML,因为HTML不是常规语言。改用HTML解析器。 (标记位置) 如果这是真的,在显示之前从字符串中删除HTML的适当工具是什么? (请记住,HTML已经是安全的;它在进入数据库之前已经过处理。) 我知道 HTMLEditFormat()和 HTMLCodeFormat(),但这两个函数不提供我需要的;较早的用特殊字符替换它们的HTML转义等价物,而后者完全相同,但也包装字符串a 我想要做的是在HTML标题属性中显示之前从HTML和换行符中删除一个字符串 有时候不需要HTML。比如说,你想显示一个没有HTML存储的帖子的摘录。 我不同意推理你引用。虽然HTML不应该使用regexen进行解析, 但是你会想要更小心的只是 插入错误的 需要类似 I recently started using the following RegEx in a ReReplace() function to strip HTML tags from a string using ColdFusion. Please note: I am not using this as protection from XSS or SQL injection; this is only to remove existing and safe HTML from a string before it's displayed in an HTML title attribute. In a semi-related question I asked how to modify my RegEx to include spaces and line breaks. I was told that using RegEx for this purpose is not appropriate and this post was referenced as an explanation. I strongly suspect though that the regular expressions you have posted don't in fact work correctly. I'd advise you not to use regular expressions to parse HTML as HTML is not a regular language. Use an HTML parser instead. (Mark Byers) If this is true, what is the appropriate tool for removing HTML from a string before it's displayed? (Baring in mind the HTML is already safe; it's sanitized before entry to the DB). I am aware of HTMLEditFormat() and HTMLCodeFormat(), but those two functions do not provide what I need; the earlier replaces special characters with their HTML-escaped equivalents, while the latter does exactly the same but also wraps the string a What I would like to do is clean a string from HTML and line breaks before I display in an HTML title attribute There are times when the HTML is not necessary. Say you wanted to display an excerpt from a post without the HTML stored along with it, for instance. I disagree with the reasoning you quote. While HTML should not be parsed with regexen, stripping tags is perfect for them. But you will want to be more careful than just into the ill-formed So you need something like 这篇关于从字符串中删除HTML的最好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
< pre>
标签。
$ b < a title =我的字符串没有HTML在这里> ...< / a>
< [^>] *>
,因为那将会变成
< span title =>> ...< / span>
> ...< / span>
<([^>] |[^'] *|[^'] *')*>
如果你喜欢一个正则表达式,你可以使用 \\\
(甚至使用交替组合它与上面的,但是这甚至更低效)。
REReplaceNoCase(str,"<[^>]*>","","ALL")
<pre>
tag.<a title="My string without HTML goes here">...</a>
<[^>]*>
, since that would turn<span title=">">...</span>
">...</span>
<([^">]|"[^"]*"|'[^']*')*>
instead. You can strip out line breaks with character replacement instead of a regex, but if you prefer a regex you can use something like \n
(or even combine it with the above using alternation, but that's even less efficient).