如何使用Apache POI提取word文档的格式信息? [英] How to extract formatting information of word document using Apache POI?

查看:37
本文介绍了如何使用Apache POI提取word文档的格式信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Apache POI 从 MS Word 文件中提取格式信息.

我想提取段落是否有项目符号、背景颜色、前景色、对齐方式等信息.

没有太多可用的文档或教程.Javadoc 也没有包含太多有用的信息.

从哪里可以获得可以帮助我学习 Apache POI API 的教程/好的文档?

解决方案

对于 HWPF (.doc),您可能需要的类是:

根据您想要的确切属性,可能是段落或字符属性.

我能想到的使用 HWPF 阅读 Word 文档并获取文本、检查样式和格式等的最佳示例是来自 Apache Tika 的 WordExtractor:https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

(.docx 的 XWPF 类似)

I am using Apache POI for extracting formatting information from MS word files.

I want to extract information like whether paragraph is having bullet, background color, forecolor, alignment, etc.

There is not much documentation or tutorials available for this. Javadoc also does not contain much helpful information.

Where can I get tutorials/good documentation which can help me in learning Apache POI API??

解决方案

For HWPF (.doc), the classes you probably want are:

Depending on the exact property you want, it may be on the paragraph or the character properties.

The best example I can think of for reading a word document with HWPF and getting text, checking styles and formatting etc is WordExtractor from Apache Tika: https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

(XWPF for .docx is similar)

这篇关于如何使用Apache POI提取word文档的格式信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆