如何在 C# 中将 HTML 转换为文本? [英] How can I Convert HTML to Text in C#?

查看:48
本文介绍了如何在 C# 中将 HTML 转换为文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找将 HTML 文档转换为纯文本的 C# 代码.

I'm looking for C# code to convert an HTML document to plain text.

我不是在寻找简单的标签剥离,而是在合理保留原始布局的情况下输出纯文本的东西.

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

输出应如下所示:

W3C 的 Html2Txt

我已经查看了 HTML Agility Pack,但我认为这不是我所需要的.有人有其他建议吗?

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

我只是从 CodePlex<下载 HTML Agility Pack/a>,并运行 Html2Txt 项目.多么令人失望(至少是将 html 转换为文本的模块)!它所做的只是去除标签、展平表格等.输出看起来与 Html2Txt @ W3C 生成的完全不同.太糟糕了,源似乎不可用.我想看看是否有更罐头"的解决方案可用.

I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

编辑 2:感谢大家的建议.FlySwat 向我提示了我想去的方向.我可以使用 System.Diagnostics.Process 类运行带有-dump"开关的 lynx.exe 将文本发送到标准输出,并使用 ProcessStartInfo.UseShellExecute = false 捕获标准输出ProcessStartInfo.RedirectStandardOutput = true.我将把所有这些都封装在一个 C# 类中.这段代码只会偶尔被调用,所以我不太关心生成一个新进程与在代码中执行它.另外,Lynx 速度很快!!

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

推荐答案

您正在寻找的是一种输出文本的文本模式 DOM 渲染器,很像 Lynx 或其他文本浏览器......你会期望的.

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

这篇关于如何在 C# 中将 HTML 转换为文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆