Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式 [英] Java Apache POI read Word (.doc) file and get named styles used
问题描述
我正在尝试使用 poi-scratchpad-3.8 (HWPF) 读取 Microsoft Word 2003 文档 (.doc).我需要逐字或逐字读取文件.无论哪种方式都可以满足我的需要.一旦我阅读了一个字符或单词,我需要获取应用于该单词/字符的样式名称.那么,问题是,如何在阅读 .doc 文件时获取用于单词或字符的样式名称?
I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way is fine for what I need. Once I have read either a character or word, I need to get the style name that is applied to the word/character. So, the question is, how do I get the style name used for a word or character when reading the .doc file?
编辑
我正在添加用于尝试此操作的代码.如果有人想尝试这个,祝你好运.
I am adding the code that I used to attempt this. If anyone wants to attempt this, good luck.
private void processDoc(String path) throws Exception {
System.out.println(path);
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
HWPFDocument wdDoc = new HWPFDocument(fis);
// list all style names and indexes in stylesheet
for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
} else {
// getStyleDescription returned null
System.out.println(j + ": " + null);
}
}
// set range for entire document
Range range = wdDoc.getRange();
// loop through all paragraphs in range
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph p = range.getParagraph(i);
// check if style index is greater than total number of styles
if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
String styleName = style.getName();
// write style name and associated text
System.out.println(styleName + " -> " + p.text());
} else {
System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
}
}
推荐答案
我建议你查看源代码到 来自 Apache Tika 的 WordExtractor,因为它是获取文本和使用 Apache POI 从 Word 文档中设置样式
I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI
根据您在问题中做了什么和没有说什么,我怀疑您正在寻找类似这样的东西:
Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:
Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
Paragraph p = r.getParagraph(i);
String text = p.getText();
if( ! text.contains("What I'm Looking For")) {
// Try the next paragraph
continue;
}
if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
StyleDescription style =
document.getStyleSheet().getStyleDescription(p.getStyleIndex());
String styleName = style.getName();
System.out.println(styleName + " -> " + text);
}
else {
// Text has an unknown or invalid style
}
}
要了解更高级的内容,请查看 WordExtractor 源代码,看看您还能用这种东西做什么!
For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!
这篇关于Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!