保存通过Office API多个Word文档为HTML [英] Saving multiple Word documents as HTML through Office API

查看:255
本文介绍了保存通过Office API多个Word文档为HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的,我需要解析Word文档。因为它们都从同一个模板创建的,我认为最好的办法是将其保存为HTML文件并解析HTML本身。

I have a large amount of Word documents that I need to parse. As they all were created from the same template, I think that the best approach would be to save them as HTML files and parse the HTML itself.

虽然它很容易保存一个Word文档作为HTML,我还没有找到一种方法,从里面的Word做批量的过程。因此,我试图找到一种方法,利用了Microsoft Office / Word中的API来实现这一目标。

While it's quite easy to save a single Word document as HTML, I haven't found a way to do a bulk procedure from inside Word. Thus, I'm trying to find a way to leverage the Microsoft Office/Word API to accomplish this.

我如何使用Word API许多Word文档另存为HTML?

先谢谢了。

更新:几更多详情...

有些文件是扩展名 .DOC 的,有的则是 .DOCX 。我希望这不是一个问题,但如果是这样,我只好全部转换为 .DOCX ,希望与API或的 DOCX

Some of the documents are of extension .doc, while others are .docx. I hope that this isn't a problem, but if it is, I'll just have to convert them all to .docx, hopefully with the API or with DocX.

DOCX发言时,我在锯>,它可能保存 .DOCX 文件作为HTML具有以下code:

Speaking of DocX, I saw on the author's blog that it's possible to save a .docx file as HTML with the following code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Word = Microsoft.Office.Interop.Word;
using Microsoft.Office.Interop.Word;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            // Convert Input.docx into Output.doc
            Convert(@"C:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.doc", WdSaveFormat.wdFormatDocument);

            /*
             * Convert Input.docx into Output.pdf
             * Please note: You must have the Microsoft Office 2007 Add-in: Microsoft Save as PDF or XPS installed
             * http://www.microsoft.com/downloads/details.aspx?FamilyId=4D951911-3E7E-4AE6-B059-A2E79ED87041&displaylang=en
             */
            Convert(@"c:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.pdf", WdSaveFormat.wdFormatPDF);

            // Convert Input.docx into Output.html
            Convert(@"c:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.html", WdSaveFormat.wdFormatHTML);
        }

        // Convert a Word 2008 .docx to Word 2003 .doc
        public static void Convert(string input, string output, WdSaveFormat format)
        {
            // Create an instance of Word.exe
            Word._Application oWord = new Word.Application();

            // Make this instance of word invisible (Can still see it in the taskmgr).
            oWord.Visible = false;

            // Interop requires objects.
            object oMissing = System.Reflection.Missing.Value;
            object isVisible = true;
            object readOnly = false;
            object oInput = input;
            object oOutput = output;
            object oFormat = format;

            // Load a document into our instance of word.exe
            Word._Document oDoc = oWord.Documents.Open(ref oInput, ref oMissing, ref readOnly, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref isVisible, ref oMissing, ref oMissing, ref oMissing, ref oMissing);

            // Make this document the active document.
            oDoc.Activate();

            // Save this document in Word 2003 format.
            oDoc.SaveAs(ref oOutput, ref oFormat, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);

            // Always close Word.exe.
            oWord.Quit(ref oMissing, ref oMissing, ref oMissing);
        }
    }
}

这是做到这一点的最好方法是什么?

Is this the best way to do it?

推荐答案

您已经张贴以上应该为你做的工作的code。另外,作为据我所知Document.SaveAs阿比可以转换,它可以在Word中打开HTML的任何文件(DOCX,DOC,RTF)(或任何其它格式)

The code you have posted above should do the job for you. Also as far as i know Document.SaveAs Api can convert any document(docx,doc,rtf) which it can open in word to HTML(or any other format)

也是,而不是创建每个文件的Word应用程序实例的字符串[]名传递给转换API,并只有当你用另存为完成

also instead of creating a word application instance for each file pass the string[] of names to the convert api and only dispose document instance once you are done with save as

这篇关于保存通过Office API多个Word文档为HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆