Java的Apache的POI读字(.DOC)文件和习惯命名样式 [英] Java Apache POI read Word (.doc) file and get named styles used
问题描述
我试图读取使用POI暂存器-3.8(HWPF)一个Microsoft Word 2003文档(.doc)。我需要或者读取字的文件字或逐个字符。无论哪种方式是好的我需要的东西。有一次,我已经阅读或者是字符或单词,我需要得到被施加到字/字符样式名称。所以,问题是,我怎么读.doc文件时使用的单词或字符样式名称?
修改
我加入code,我用来尝试这个。如果有人想尝试此,祝你好运。
私人无效processDoc(字符串路径)抛出异常{
的System.out.println(路径);
POIFSFileSystem FIS =新POIFSFileSystem(新的FileInputStream(路径));
HWPFDocument wdDoc =新HWPFDocument(FIS); //列表样式的所有样式名和索引
为(中间体J = 0; J&下; wdDoc.getStyleSheet()numStyles(); J ++){
如果(wdDoc.getStyleSheet()。getStyleDescription(J)!= NULL){
的System.out.println第(j +:+ wdDoc.getStyleSheet()getStyleDescription(j)条.getName());
}其他{
// getStyleDescription返回NULL
的System.out.println第(j +:+空);
}
} //设置范围整个文档
范围范围= wdDoc.getRange(); //通过范围内的所有段落循环
的for(int i = 0; I< range.numParagraphs();我++){
段p值= range.getParagraph(ⅰ); //检查风格指数比款式总数量较大
如果(wdDoc.getStyleSheet()numStyles()方式> p.getStyleIndex()){
的System.out.println(wdDoc.getStyleSheet()numStyles()+ - >中。+ p.getStyleIndex());
StyleDescription风格= wdDoc.getStyleSheet()getStyleDescription(p.getStyleIndex());
字符串的styleName = style.getName();
//写样式名称和相关文本
的System.out.println(的styleName + - >中+ p.text());
}其他{
的System.out.println(\\ n+ wdDoc.getStyleSheet()numStyles()+---->中+ p.getStyleIndex());
}
}
我建议你看一看源$ C $ C到<一个href=\"http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java\"相对=nofollow>来自Apache提卡 WordExtractor,因为它正从使用Apache POI Word文档的文本和造型的一个很好的例子。
根据你压根没在你的问题,我怀疑你正在寻找的东西有点像这样:
范围R = document.getRange();
的for(int i = 0; I&LT; r.numParagraphs();我++){
段p值= r.getParagraph(ⅰ);
字符串文本= p.getText();
如果(!text.contains(我正在寻找)){
//尝试下一个段落
继续;
} 如果(document.getStyleSheet()numStyles()方式&gt; p.getStyleIndex()){
StyleDescription风格=
。document.getStyleSheet()getStyleDescription(p.getStyleIndex());
字符串的styleName = style.getName();
的System.out.println(styleName来+ - &gt;中+文字);
}
其他{
//文本有一个未知的或无效的风格
}
}
对于任何更高级的,看看在WordExtractor源$ C $ C,看看你可以用这样的事情做些什么!
I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way is fine for what I need. Once I have read either a character or word, I need to get the style name that is applied to the word/character. So, the question is, how do I get the style name used for a word or character when reading the .doc file?
EDIT
I am adding the code that I used to attempt this. If anyone wants to attempt this, good luck.
private void processDoc(String path) throws Exception {
System.out.println(path);
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
HWPFDocument wdDoc = new HWPFDocument(fis);
// list all style names and indexes in stylesheet
for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
} else {
// getStyleDescription returned null
System.out.println(j + ": " + null);
}
}
// set range for entire document
Range range = wdDoc.getRange();
// loop through all paragraphs in range
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph p = range.getParagraph(i);
// check if style index is greater than total number of styles
if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
String styleName = style.getName();
// write style name and associated text
System.out.println(styleName + " -> " + p.text());
} else {
System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
}
}
I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI
Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:
Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
Paragraph p = r.getParagraph(i);
String text = p.getText();
if( ! text.contains("What I'm Looking For")) {
// Try the next paragraph
continue;
}
if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
StyleDescription style =
document.getStyleSheet().getStyleDescription(p.getStyleIndex());
String styleName = style.getName();
System.out.println(styleName + " -> " + text);
}
else {
// Text has an unknown or invalid style
}
}
For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!
这篇关于Java的Apache的POI读字(.DOC)文件和习惯命名样式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!