我怎样才能将HTML转换成在C#中的文字? [英] How can I Convert HTML to Text in C#?

查看:124
本文介绍了我怎样才能将HTML转换成在C#中的文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找C#code到HTML文档转换为纯文本。

I'm looking for C# code to convert an HTML document to plain text.

我不是在寻找简单的标签剥离,但东西,将输出纯文本用的合理的原始布局的preservation。

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

输出应该是这样的:

HTML2TXT在W3C

我看了一下HTML敏捷性包,但我不认为这是我需要什么。没有人有任何其他建议?

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

编辑:我刚刚从 codePLEX 下载HTML敏捷性包,和跑HTML2TXT项目。真让人失望(至少模块,做HTML文本转换)!它所作的只是剥去标签,压平表等。输出看起来并不像HTML2TXT任何@生产W3C。太糟糕了,来源似乎并不可用。
我一直在寻找,看看是否有更多的罐头的解决方案。

I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

编辑2:谢谢大家对你的建议。 FlySwat 放倒我,我想要去的方向。我可以使用的System.Diagnostics.Process 类与运行lynx.exe了-dump开关,将文本发送到标准输出,并与<$捕捉到标准输出C $ C> ProcessStartInfo.UseShellExecute = FALSE 和 ProcessStartInfo.RedirectStandardOutput = TRUE 。我将包装这个都在一个C#类。这code将只occassionly调用,所以我不是太在意产卵一个新的进程主场迎战code这样做。另外,山猫是FAST!

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

推荐答案

你所寻找的是一个文本模式DOM渲染器,输出文本,就像山猫或其他文本浏览器...这是更难做的比你所期望的。

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

这篇关于我怎样才能将HTML转换成在C#中的文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆