如何提取resonably理智的HTML文本? [英] How to extract text from resonably sane HTML?
本文介绍了如何提取resonably理智的HTML文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的问题是有点像这个问题但我有更多的约束:
My question is sort of like this question but I have more constraints:
- 我知道文档的是合理的理智
- 他们是非常有规律的(他们都从相同的源来
- 我想看到的文本的约99%
- 什么是所有可行的约99%是文字(它们是或多或少RTF转换成HTML)
- 我不关心格式,甚至一段休息时间。
- I know the document's are reasonably sane
- they are very regular (they all came from the same source
- I want about 99% of the visible text
- about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
- I don't care about formatting or even paragraph breaks.
有没有设置要做到这一点任何工具还是我最好还是先打破了使用RegexBuddy和C#?
Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?
我打开命令行或批处理。工具,以及C / C#/ D库
I'm open to command line or batch processing tools as well as C/C#/D libraries.
推荐答案
您需要使用的HTML敏捷性包。
您可能想使用LINQ蚂蚁后人找到一个元素
通话,然后得到它的的InnerText
。
You probably want to find an element using LINQ ant the Descendants
call, then get its InnerText
.
这篇关于如何提取resonably理智的HTML文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文