我怎样才能将HTML转换成在C#中的文字? [英] How can I Convert HTML to Text in C#?
问题描述
我在寻找C#code到HTML文档转换为纯文本。
I'm looking for C# code to convert an HTML document to plain text.
我不是在寻找简单的标签剥离,但东西,将输出纯文本用的合理的原始布局的preservation。
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
输出应该是这样的:
我看了一下HTML敏捷性包,但我不认为这是我需要什么。没有人有任何其他建议?
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
编辑:我刚刚从 codePLEX 下载HTML敏捷性包,和跑HTML2TXT项目。真让人失望(至少模块,做HTML文本转换)!它所作的只是剥去标签,压平表等。输出看起来并不像HTML2TXT任何@生产W3C。太糟糕了,来源似乎并不可用。
我一直在寻找,看看是否有更多的罐头的解决方案。
I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.
编辑2:谢谢大家对你的建议。 FlySwat 放倒我,我想要去的方向。我可以使用的System.Diagnostics.Process
类与运行lynx.exe了-dump开关,将文本发送到标准输出,并与<$捕捉到标准输出C $ C> ProcessStartInfo.UseShellExecute = FALSE 和 ProcessStartInfo.RedirectStandardOutput = TRUE
。我将包装这个都在一个C#类。这code将只occassionly调用,所以我不是太在意产卵一个新的进程主场迎战code这样做。另外,山猫是FAST!
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process
class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false
and ProcessStartInfo.RedirectStandardOutput = true
. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
推荐答案
你所寻找的是一个文本模式DOM渲染器,输出文本,就像山猫或其他文本浏览器...这是更难做的比你所期望的。
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
这篇关于我怎样才能将HTML转换成在C#中的文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!