Java的Apache的POI读字（.DOC）文件和习惯命名样式 [英] Java Apache POI read Word (.doc) file and get named styles used

查看：160 发布时间：2016/5/22 13:35:14 java ms-word apache-poi

本文介绍了Java的Apache的POI读字（.DOC）文件和习惯命名样式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图读取使用POI暂存器-3.8（HWPF）一个Microsoft Word 2003文档（.doc）。我需要或者读取字的文件字或逐个字符。无论哪种方式是好的我需要的东西。有一次，我已经阅读或者是字符或单词，我需要得到被施加到字/字符样式名称。所以，问题是，我怎么读.doc文件时使用的单词或字符样式名称？

修改

我加入code，我用来尝试这个。如果有人想尝试此，祝你好运。

 私人无效processDoc（字符串路径）抛出异常{
    的System.out.println（路径）;
    POIFSFileSystem FIS =新POIFSFileSystem（新的FileInputStream（路径））;
    HWPFDocument wdDoc =新HWPFDocument（FIS）;    //列表样式的所有样式名和索引
    为（中间体J = 0; J＆下; wdDoc.getStyleSheet（）numStyles（）; J ++）{
        如果（wdDoc.getStyleSheet（）。getStyleDescription（J）！= NULL）{
            的System.out.println第（j +：+ wdDoc.getStyleSheet（）getStyleDescription（j）条.getName（））;
        }其他{
            // getStyleDescription返回NULL
            的System.out.println第（j +：+空）;
        }
    }    //设置范围整个文档
    范围范围= wdDoc.getRange（）;    //通过范围内的所有段落循环
    的for（int i = 0; I＆LT; range.numParagraphs（）;我++）{
        段p值= range.getParagraph（ⅰ）;        //检查风格指数比款式总数量较大
        如果（wdDoc.getStyleSheet（）numStyles（）方式＆gt; p.getStyleIndex（））{
            的System.out.println（wdDoc.getStyleSheet（）numStyles（）+ - ＆gt;中。+ p.getStyleIndex（））;
            StyleDescription风格= wdDoc.getStyleSheet（）getStyleDescription（p.getStyleIndex（））;
            字符串的styleName = style.getName（）;
            //写样式名称和相关文本
            的System.out.println（的styleName + - ＆gt;中+ p.text（））;
        }其他{
            的System.out.println（\\ n+ wdDoc.getStyleSheet（）numStyles（）+----＆gt;中+ p.getStyleIndex（））;
        }
    }

解决方案

我建议你看一看源$ C $ C到<一个href=\"http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java\"相对=nofollow>来自Apache提卡 WordExtractor，因为它正从使用Apache POI Word文档的文本和造型的一个很好的例子。

根据你压根没在你的问题，我怀疑你正在寻找的东西有点像这样：

 范围R = document.getRange（）;
    的for（int i = 0; I＆LT; r.numParagraphs（）;我++）{
       段p值= r.getParagraph（ⅰ）;
       字符串文本= p.getText（）;
       如果（！text.contains（我正在寻找））{
          //尝试下一个段落
          继续;
       }       如果（document.getStyleSheet（）numStyles（）方式＆gt; p.getStyleIndex（））{
          StyleDescription风格=
               。document.getStyleSheet（）getStyleDescription（p.getStyleIndex（））;
          字符串的styleName = style.getName（）;
          的System.out.println（styleName来+ - ＆gt;中+文字）;
       }
       其他{
          //文本有一个未知的或无效的风格
       }
    }

对于任何更高级的，看看在WordExtractor源$ C $ C，看看你可以用这样的事情做些什么！

I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way is fine for what I need. Once I have read either a character or word, I need to get the style name that is applied to the word/character. So, the question is, how do I get the style name used for a word or character when reading the .doc file?

EDIT

I am adding the code that I used to attempt this. If anyone wants to attempt this, good luck.

private void processDoc(String path) throws Exception {
    System.out.println(path);
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
    HWPFDocument wdDoc = new HWPFDocument(fis);

    // list all style names and indexes in stylesheet
    for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
        if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
            System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
        } else {
            // getStyleDescription returned null
            System.out.println(j + ": " + null);
        }
    }

    // set range for entire document
    Range range = wdDoc.getRange();

    // loop through all paragraphs in range
    for (int i = 0; i < range.numParagraphs(); i++) {
        Paragraph p = range.getParagraph(i);

        // check if style index is greater than total number of styles
        if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
            System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
            StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
            String styleName = style.getName();
            // write style name and associated text
            System.out.println(styleName + " -> " + p.text());
        } else {
            System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
        }
    }

解决方案

I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI

Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:

    Range r = document.getRange();
    for(int i=0; i<r.numParagraphs(); i++) {
       Paragraph p = r.getParagraph(i);
       String text = p.getText();
       if( ! text.contains("What I'm Looking For")) {
          // Try the next paragraph
          continue;
       }

       if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
          StyleDescription style =
               document.getStyleSheet().getStyleDescription(p.getStyleIndex());
          String styleName = style.getName();
          System.out.println(styleName + " -> " + text);
       }
       else {
          // Text has an unknown or invalid style
       }
    }

For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!

这篇关于Java的Apache的POI读字（.DOC）文件和习惯命名样式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Java的Apache的POI读字（.DOC）文件和习惯命名样式 [英] Java Apache POI read Word (.doc) file and get named styles used

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java的Apache的POI读字（.DOC）文件和习惯命名样式 [英] Java Apache POI read Word (.doc) file and get named styles used

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭