是否可以使用 Apache POI 解析 MS Word 并将其转换为 XML? [英] Is it possible to parse MS Word using Apache POI and convert it into XML?

查看:22
本文介绍了是否可以使用 Apache POI 解析 MS Word 并将其转换为 XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用 Apache POI 将 MS Word 转换为 XML 文件?

Is it possible to convert a MS Word to XML file using Apache POI ?

如果是,你能指点我做这件事的任何教程吗?

If it is, can you point me to any tutorials for doing that?

推荐答案

我想你有两个选择,都由 Apache POI 提供支持

I'd say you have two options, both powered by Apache POI

一种是使用 Apache Tika.Tika 是一个文本和元数据提取工具包,能够通过适当调用 POI 从 Word 文档中提取相当丰富的文本.结果是 Tika 将为您的 Word 文档的内容提供 XHTML 样式的 XML.

One is to use Apache Tika. Tika is a text and metadata extraction toolkit, and is able to extract fairly rich text from Word documents by making appropriate calls to POI. The result is that Tika will give you XHTML style XML for the contents of your word document.

另一种选择是使用最近添加到 POI 的类,即 WordToHtmlConverter.这会将您的 word 文档转换为 HTML,并且通常会比 Tika 保留更多的结构和格式.

The other option is to use a class that was added fairly recently to POI, which is WordToHtmlConverter. This will turn your word document into HTML for you, and generally will preserve slightly more of the structure and formatting than Tika will.

根据您希望得到的 XML 类型,其中一个应该是您的好选择.我建议您针对一些示例文件尝试这两种方法,看看哪一种最适合您的问题域和需求.

Depending on the kind of XML you're hoping to get out, one of these should be a good bet for you. I'd suggest you try both against some of your sample files, and see which one is the best fit for your problem domain and needs.

这篇关于是否可以使用 Apache POI 解析 MS Word 并将其转换为 XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆