是否有可能使用Apache POI解析MS Word和其转换成XML? [英] Is it possible to parse MS Word using Apache POI and convert it into XML?

查看:724
本文介绍了是否有可能使用Apache POI解析MS Word和其转换成XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能使用到的MS Word转换成XML文件的Apache POI?

Is it possible to convert a MS Word to XML file using Apache POI ?

如果是这样,你能指出我的任何教程做呢?

If it is, can you point me to any tutorials for doing that?

推荐答案

我说你有两个选择,无论是由Apache POI

I'd say you have two options, both powered by Apache POI

一种是使用的Apache提卡。蒂卡是一个文本和元数据提取工具,并能够通过向POI适当的调用从Word文档中提取相当丰富的文本。其结果是,提卡会给你XHTML样式XML为您的Word文档中的内容。

One is to use Apache Tika. Tika is a text and metadata extraction toolkit, and is able to extract fairly rich text from Word documents by making appropriate calls to POI. The result is that Tika will give you XHTML style XML for the contents of your word document.

另一种选择是使用最近相当加入POI一类,这是的 WordToHtmlConverter 。这将打开您的Word文档转换成HTML给你,一般都会preserve稍多的结构和格式比提卡意愿。

The other option is to use a class that was added fairly recently to POI, which is WordToHtmlConverter. This will turn your word document into HTML for you, and generally will preserve slightly more of the structure and formatting than Tika will.

根据你希望摆脱那种XML的,其中之一应该是您一个不错的选择。我建议你​​尝试两种对一些样品的文件,看看哪一个是你的问题域和需求的最合适的。

Depending on the kind of XML you're hoping to get out, one of these should be a good bet for you. I'd suggest you try both against some of your sample files, and see which one is the best fit for your problem domain and needs.

这篇关于是否有可能使用Apache POI解析MS Word和其转换成XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆