如何提取resonably理智的HTML文本？ [英] How to extract text from resonably sane HTML?

查看：256 发布时间：2016/9/18 10:53:04 c# html d text-extraction

本文介绍了如何提取resonably理智的HTML文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我的问题是有点像这个问题但我有更多的约束：

My question is sort of like this question but I have more constraints:

I know the document's are reasonably sane
they are very regular (they all came from the same source
I want about 99% of the visible text
about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
I don't care about formatting or even paragraph breaks.

有没有设置要做到这一点任何工具还是我最好还是先打破了使用RegexBuddy和C＃？

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

我打开命令行或批处理。工具，以及C / C＃/ D库

I'm open to command line or batch processing tools as well as C/C#/D libraries.