将docx批量转换为清理HTML [英] Batch conversion of docx to clean HTML

查看:140
本文介绍了将docx批量转换为清理HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始怀疑这是否可能。我在Google上搜索了解决方案,并没有提供任何完全符合我喜欢的解决方案。



我认为这有助于解释需要什么。我在我大学的IT部门为数据库小组工作。我的主要工作是在docx文件中获取报告的规格,将其复制到Dreamweaver中,修正一些格式并将其放到他们的网站上。我的问题是,一遍又一遍地做这件事很荒谬。我想,嘿,我一段时间都没有用C#编写任何东西,也许我可以编写一个应用程序来获取docx文件,将其转换为HTML,修复CSS,将页眉和页脚粘贴在那里的网页上并保存结果。我原本计划让它一个接一个,但它可能并不难,让它输入文件列表和批量转换。



我找到了这些有关如何实现这一目标的相关主题,但它们并不适合我的需求。

http://www.techrepublic.com/blog/howdoi/how-do-i-modify -b-word-documents-using-c / 190



对于一些文档来说这可能很好,但是因为它只是自动化Word的一个实例,所以我觉得就像它会很慢并且内存密集。我宁愿避免打开和关闭Word的实例50次以上。

http://openxmldeveloper.org/articles/333.aspx



这是我开始使用的。 XSLT的优点是不需要为每个文件安装或运行。经过一番搜索后,我得到了一个概念验证的工作。它需要一个docx文件,将其解压缩,从中抓取document.xml,并使用从OpenXML查看器中清除的DocX2Html.xsl文件。我相信最初由MS提供的SharePoint服务器提供了在浏览器中呈现Word文档的能力。或者沿着这些线路。



在调整了代码以适合我的需求并且遇到objXSLT.Load()方法的问题后,我最终使用IlMerge来创建将XSL转换成DLL。不知道为什么我在使用普通的旧XSL文件时不断收到编译错误,但是DLL工作正常,所以我很满意。这里( http://pastebin.com/a5HBAakJ )是我目前的代码。它将docx转换为HTML的工作很好(除了某些单词之间的随机空格之外),但结果文件具有可笑的HTML语法。这个怪物的例子可以在这里找到( http://pastebin.com/b8sPGmFE )。



有谁知道我可以如何补救?我想也许我需要创建一个新的XSL文件,因为MS提供的是负责将所有这些标签和额外代码粘贴在那里的东西。我的问题是,我不知道如何做到这一点。也许有一个替代版本已经在那里。我所需要的只是保存表格和文本格式。图片不是必需的。

解决方案

这看起来就像你需要的: http://msdn.microsoft.com/en-us/library/ff628051(v = office.14 ).aspx



作者Eric White博客讨论了他开发该工具的经验。你可以在他的博客上看到帖子列表: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of -contents.aspx#Open_XML_to_XHtml


I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.

I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.

I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.

http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190

This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.

http://openxmldeveloper.org/articles/333.aspx

This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.

After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).

Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

解决方案

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx

The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

这篇关于将docx批量转换为清理HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆