C#html到word或text [英] C# html to word or text

查看:85
本文介绍了C#html到word或text的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好。现在我正在写博客(刚刚开始)。我希望将网址转换为文字或文字。让我们说 http://www.yahoo.com/ [ ^ ]。我想将其内容提取为单词或文本。所以我可以添加我的,或编辑它的格式。我该怎么办?

解决方案

基本上,您可以通过从HTML代码中删除所有标记将其转换为文本,这是一个简单的字符串操作例程。另一个问题是:您需要将所有HTML 字符实体转换为Unicode字符。通过这种方式,一些HTML解析器可以非常方便。它将为您提供所有文本节点的值,并且已排除所有字符实体。



最简单的方法是使用XML解析器,但是只有在HTML是格式良好的XML时才能使用。 XML解析器在.NET中很容易获得>这实在是太遗憾了,但现实世界中存在的许多HTML代码都不符合它。在这种情况下,您需要一些可以使用非格式化代码的解析器。在这种情况下,请查看C#中可用的HTML解析器之一。 CodeProject中包含以下项目:

基本HTML解析器 [< a href =http://www.codeproject.com/KB/cs/htmlparser.aspxtarget =_ blanktitle =New Window> ^ ]

另一个使用标签处理的C#Legacy HTML Parser [ ^ ]

在C#中使用HTML解析器 [ ^ ]。



如果这还不够,请自行搜索:

http://en.lmgtfy.com/?q=HTML+parser+%22C%23%22 [ ^ ]。



顺便说一句,提高你的Google技能;它会对你有很大的帮助。



现在,你想要什么样的Word文档:格式接近HTML源代码?



如果是第一种情况,你可以用Word打开HTML文档(并保存为.doc或.docx文件);在第二种情况下,你不应该做任何其他事情;您可以将纯Unicode文本视为Word文档的部分案例。



如果您需要自动进行转换(但我不知道为什么,HTML)无论如何都可以用Word打开文档),你需要使用Office / Word互操作。



要创建Word文档,请使用Office互操作程序集。基本上,在代码资源管理器的项目参考选项卡中,单击添加引用,使用添加引用窗口的COM选项卡,添加对所需版本的Microsoft Word对象库的引用。请参阅:

http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.11​​%29.aspx [ ^ ],

http:// msdn .microsoft.com / zh-CN / library / microsoft.office.interop.word%28v = office.14%29.aspx [ ^ ]。



(或所需版本的类似文件。)



参见:

http://msdn.microsoft.com/en-us/library/aa192495%28v=office 0.11%29.aspx [ ^ ],

http://msdn.microsoft.com/en-us /office/hh128772.aspx [ ^ ]。





-SA


hi, everyone. Now I am writing a blog(just get start). I wish to convert url to word or text. Let''s say http://www.yahoo.com/[^]. I want to extract its content to word or text. So I can add mine into it, or edit its formating. How could I do ?

解决方案

Basically, you can convert it to text by removing all tags from the HTML code, which is a simple string-manipulation routine. Another problem is: you would need to convert all HTML character entities to Unicode characters. In this way, some HTML parser could be very handy. It will give you the values for all text nodes, with all the character entities already excluded.

The simplest way of doing it would be using XML parser, but if can only work if your HTML is a well-formed XML. The XML parsers are readily available in .NET> This is really a shame, but many HTML codes existing in the real world do not conform to it. In this case, you would need some parser which can work with non-well-formed code. In this case, look at one of available HTML parser in C#. The following projects are found in CodeProject:
An Elementary HTML Parser[^]
Another C# Legacy HTML Parser Using Tag Processing[^]
AfterWork HTML Parser in C#[^].

If this is not enough, do your own search:
http://en.lmgtfy.com/?q=HTML+parser+%22C%23%22[^].

Sharpen your Google skills, by the way; it will greatly help you.

Now, what kind of Word document do you want: with formatting close to the HTML source of not?

If first case, you can just open HTML document with Word (and save as .doc or .docx file); in second case, you should not do anything else; you can consider plain Unicode text as a partial case of Word document.

If you need to do the conversion automatically (but I don''t know why, of HTML document can be opened with Word anyway), you would need to use Office/Word interop.

To create a Word document, use Office interop assembly. Basically, in your project''s "References" tab of the Code Explorer, click "Add Reference", use the tab "COM" of the "Add Reference" window, add the reference to Microsoft Word Object Library of required version. Please see:
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.11%29.aspx[^],
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.14%29.aspx[^].

(Or similar piece of documentation for required version.)

See also:
http://msdn.microsoft.com/en-us/library/aa192495%28v=office.11%29.aspx[^],
http://msdn.microsoft.com/en-us/office/hh128772.aspx[^].


—SA


这篇关于C#html到word或text的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆