如何使用apache poi获取doc,docx文件中特定单词的行号,页码? [英] How to get the line number, page number of a particular word in a doc,docx file using apache poi?

查看:66
本文介绍了如何使用apache poi获取doc,docx文件中特定单词的行号,页码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个 java 应用程序,它将在选定的 doc, docx 文件中搜索特定的单词并生成报告.该报告将包含搜索词的页码行号.现在我所取得的成就是我能够逐段阅读 docdocx 文件.但是我没有找到任何方法来搜索特定单词并获得 line &该词所在的页码.我搜索了很多,但直到现在都没有运气.希望有人知道如何做到这一点.

I am trying to create a java application which would search for particular word in the selected doc, docx file and generates a report on it. That report will contain the page number and the line number of the searched word. Now that what I have achieved is I am able to read the doc and docx file by paragraph. But I didn't find any way to search for a particular word and to get the line & page number where that word is present. I searched a lot but no luck till now. Hope someone knows the way to do this.

这是我的代码

if(fc.getSelectedFile().getAbsolutePath().contains("docx")) {
    File file = fc.getSelectedFile();
    FileInputStream fis = new FileInputStream(file.getAbsolutePath());
    XWPFDocument document = new XWPFDocument(fis);
    List<XWPFParagraph> paragraphs = document.getParagraphs();
    System.out.println("Total no of paragraph "+paragraphs.size());
    for (XWPFParagraph para : paragraphs) {
        System.out.println(para.getText());
    }
    fis.close();
} else {
    WordExtractor extractor = null;
    FileInputStream fis = new FileInputStream(fc.getSelectedFile());
    HWPFDocument document = new HWPFDocument(fis);
    extractor = new WordExtractor(document);
    String[] fileData = extractor.getParagraphText();
    for (int i = 0; i < fileData.length; i++) {
        if (fileData[i] != null)
            System.out.println(fileData[i]);
    }
    extractor.close();
}

我正在使用 swing, apache poi 3.10.1.

推荐答案

恐怕没有简单的方法可以做到这一点.不会存储行号和页码,而是根据指定的页面大小根据文本布局动态计算.页面宽度定义了文本中的换行位置.

I am afraid there is no easy way to do this. Line and page number aren't stored but calculated on fly based on text layout according to page size specified. The page widht defines wrapping positions in the text.

您可以尝试使用适当的 EditorKit 自己在 JEditorPane 中加载文档来实现该功能(参见例如 DocxEditorKit 实现的尝试 http://java-sl.com/docx_editor_kit.html 它提供了基本的功能,你可以在这里根据源代码和想法尝试实现你自己的 EditorKit).

You can try to implement the feature yourself loading the document in a JEditorPane with appropriate EditorKit (see for example the attempt of DocxEditorKit implementation http://java-sl.com/docx_editor_kit.html It provides basic functionality and you can try to implement your own EditorKit here based on the source code and ideas).

该套件应支持分页以呈现页面(请参阅此处关于分页的文章http://java-sl.com/articles.html)

The kit should support pagination to render page (See articles about pagination here http://java-sl.com/articles.html)

分页完成后,您可以找到单词的位置(插入符号偏移)并获取行/列(参见 http://java-sl.com/tip_row_column.html).

After the pagination done you can find position of the word (caret offset) and get the row/column (See http://java-sl.com/tip_row_column.html).

这篇关于如何使用apache poi获取doc,docx文件中特定单词的行号,页码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆