剥离HTML标签,而无需使用HtmlAgilityPack [英] Stripping HTML tags without using HtmlAgilityPack

查看:257
本文介绍了剥离HTML标签,而无需使用HtmlAgilityPack的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个高效,(合理的)可靠的方法来从文档中剥离HTML标签。它需要能够处理一些相当不利的情况下:

I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:

  • 在它不知道时间的文件是否包含HTML都遥遥领先。
  • 更可能的情况是,任何HTML将被格式化很差。
  • 在单个文件可能是非常大的可能是兆字节的数百个。
  • 非HTML内容仍可能与尖括号散落不管什么奇怪的原因,天真定期EX pressions沿&LT的线条; + /> 是一个没有去。 (剥XML是不太理想的,反正。)
  • It's not known ahead of time whether a document contains HTML at all.
  • More likely than not, any HTML will be very poorly formatted.
  • Individual documents might be very large, perhaps hundreds of megabytes.
  • Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)

我目前使用的HTML敏捷性包,它只是不削减芥末。性能较差比我想,它并不总是处理真正可怕的格式为优雅,因为它可以,最近我一直在运行出现问题,有一些更upsettingly大文件的堆栈溢出。

I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.

我怀疑所有这些问题的事实,它试图真正分析数据,这使得它一个贫穷的适合我的需要干。我不希望有一个语法树;我只是想(大部分)的标签消失。

I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.

使用常规的前pressions似乎是一个明显的候选人。但后来我记得这个著名的答案,这让我担心,这不是个好主意。但是,谩骂的点都非常注重分析,而不一定是愚蠢的标签剥离。那么,正则表达式确定用于此目的?

Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?

假设它是不是一个可怕的想法,建议正则表达式,将做好是非常欢迎的。

Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.

推荐答案

这正则表达式查找避免尖括号内的标签引用的所有标签。

This regex finds all tags avoiding angle brackets inside quotes in tags.

<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>

这是不能够探测到里面的引号转义引号(但我认为这是不必要的HTML)

It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)

将所有允许标签的列表,并取代它的正则表达式的第一部分,如≤(TAG1 |标签2 | ...)可能带来一个更多precise的解决方案,我怕一个确切的解决方案无法找到开始与你的尖括号的假设,认为如来像&LT; A HREF =test.html的&GT ; B&LT; A&LT; / A&GT; ...

Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like <a href="test.html"> b<a </a>...

修改

更新正则表达式(执行了很多比后者更好),而且如果需要去掉code,我建议在第一次启动之前进行一个小的清洁,像更换&LT;脚本+&LT; / SCRIPT&GT; 没事

Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.

这篇关于剥离HTML标签,而无需使用HtmlAgilityPack的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆