如何使用Apache POI提取word文档的格式信息? [英] How to extract formatting information of word document using Apache POI?
问题描述
我正在使用 Apache POI 从 MS Word 文件中提取格式信息.
我想提取段落是否有项目符号、背景颜色、前景色、对齐方式等信息.
没有太多可用的文档或教程.Javadoc 也没有包含太多有用的信息.
从哪里可以获得可以帮助我学习 Apache POI API 的教程/好的文档?
对于 HWPF (.doc),您可能需要的类是:
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/ParagraphProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/CharacterProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/model/StyleDescription.html
根据您想要的确切属性,可能是段落或字符属性.
我能想到的使用 HWPF 阅读 Word 文档并获取文本、检查样式和格式等的最佳示例是来自 Apache Tika 的 WordExtractor:https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java>
(.docx 的 XWPF 类似)
I am using Apache POI for extracting formatting information from MS word files.
I want to extract information like whether paragraph is having bullet, background color, forecolor, alignment, etc.
There is not much documentation or tutorials available for this. Javadoc also does not contain much helpful information.
Where can I get tutorials/good documentation which can help me in learning Apache POI API??
For HWPF (.doc), the classes you probably want are:
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/ParagraphProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/CharacterProperties.html
- http://poi.apache.org/apidocs/org/apache/poi/hwpf/model/StyleDescription.html
Depending on the exact property you want, it may be on the paragraph or the character properties.
The best example I can think of for reading a word document with HWPF and getting text, checking styles and formatting etc is WordExtractor from Apache Tika: https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
(XWPF for .docx is similar)
这篇关于如何使用Apache POI提取word文档的格式信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!