处理使用Apache POI的docx文件 [英] Processing docx file using APACHE POI

查看:254
本文介绍了处理使用Apache POI的docx文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从数据库中检索一个的docx,并尝试通过检查其内容来处理它。我想我的code取回我需要的文件,但似乎我还没有完全理解APACHE POI。我在堆栈跟踪得到一个错误,说我错了POI什么想法?

下面就是我如何加载文件:

 公共无效的loadFile(字符串文件名)
{
    InputStream为= NULL;
    尝试
    {
        //连接到MySQL数据库
        的Class.forName(驱动程序).newInstance();
        CON =的DriverManager.getConnection(URL +数据库名,用户名,密码);        声明语句=(声明)con.createStatement();
        结果集RS = stmt.executeQuery(选择文件从doccompfiles其中FileName ='+文件名+');        而(rs.next())
        {
            是= rs.getBinaryStream(文件);
        }        HWPFDocument DOC =新HWPFDocument(是);
        WordExtractor我们=新WordExtractor(DOC);        的String [] =段we.getParagraphText();
        JOptionPane.showMessageDialog(NULL,段数+ paragraphs.length);
        con.close();
    }
    赶上(异常前)
    {
        ex.printStackTrace();
    }
}

堆栈跟踪:

  org.apache.poi.poifs.filesystem.OfficeXmlFileException:提供的数据似乎是在Office 2007+ XML。您正在呼吁与OLE2 Office文档涉及POI的一部分。你需要调用POI的不同部分来处理这些数据(如XSSF而不是HSSF)
在org.apache.poi.poifs.storage.HeaderBlock<&初始化GT;(HeaderBlock.java:131)
在org.apache.poi.poifs.storage.HeaderBlock<&初始化GT;(HeaderBlock.java:104)
在org.apache.poi.poifs.filesystem.POIFSFileSystem<&初始化GT;(POIFSFileSystem.java:138)
在org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:106)
在org.apache.poi.hwpf.HWPFDocument<&初始化GT;(HWPFDocument.java:174)
在documentComparisor.Database.loadFile(Database.java:156)
在documentComparisor.Home $ 5.actionPerformed(Home.java:195)
在javax.swing.AbstractButton.fireActionPerformed(来源不明)
在javax.swing.AbstractButton中的$ Handler.actionPerformed(来源不明)
在javax.swing.DefaultButtonModel.fireActionPerformed(来源不明)
在javax.swing.DefaultButtonModel.set pressed(来源不明)
在javax.swing.plaf.basic.BasicButtonListener.mouseReleased(来源不明)
在java.awt.Component.processMouseEvent(来源不明)
在javax.swing.JComponent.processMouseEvent(来源不明)
在java.awt.Component.processEvent(来源不明)
在java.awt.Container.processEvent(来源不明)
在java.awt.Component.dispatchEventImpl(来源不明)
在java.awt.Container.dispatchEventImpl(来源不明)
在java.awt.Component.dispatchEvent(来源不明)
在java.awt.LightweightDispatcher.retargetMouseEvent(来源不明)
在java.awt.LightweightDispatcher.processMouseEvent(来源不明)
在java.awt.LightweightDispatcher.dispatchEvent(来源不明)
在java.awt.Container.dispatchEventImpl(来源不明)
在java.awt.Window.dispatchEventImpl(来源不明)
在java.awt.Component.dispatchEvent(来源不明)
在java.awt.EventQueue.dispatchEventImpl(来源不明)
在java.awt.EventQueue.access $ 000(来源不明)
在java.awt.EventQueue中的$ 3.run(来源不明)
在java.awt.EventQueue中的$ 3.run(来源不明)
在java.security.AccessController.doPrivileged(本机方法)
在java.security.ProtectionDomain $ 1.doIntersectionPrivilege(​​来源不明)
在java.security.ProtectionDomain $ 1.doIntersectionPrivilege(​​来源不明)
在java.awt.EventQueue中的$ 4.run(来源不明)
在java.awt.EventQueue中的$ 4.run(来源不明)
在java.security.AccessController.doPrivileged(本机方法)
在java.security.ProtectionDomain $ 1.doIntersectionPrivilege(​​来源不明)
在java.awt.EventQueue.dispatchEvent(来源不明)
在java.awt.EventDispatchThread.pumpOneEventForFilters(来源不明)
在java.awt.EventDispatchThread.pumpEventsForFilter(来源不明)
在java.awt.EventDispatchThread.pumpEventsForHierarchy(来源不明)
在java.awt.EventDispatchThread.pumpEvents(来源不明)
在java.awt.EventDispatchThread.pumpEvents(来源不明)
在java.awt.EventDispatchThread.run(来源不明)


解决方案

正如您应该知道,此刻的MS Office文档在两种不同的格式存在:一种是已使用的MS Office版本2007年之前的旧格式(例如,名为.doc或.xls的),另一个是所使用较新的版本(基于XML的格式如的.docx或的.xl​​sx)。

有一个在Apache的POI不同部分处理不同的格式。在旧的MS Office格式文件处理重点班的名称一般先从H类的名字在基于XML的格式处理文件以X。

因此​​,在你的例子来处理,你应该使用XWPFDocument代替HWPFDocument新格式为:

  XWPFDocument DOC =新XWPFDocument(是);

I'm trying to retrieve a docx from database and try to process it by checking its content. I think mycode retrieved my desired file but it seems that I haven't fully understood APACHE POI. I got an error at my stacktrace saying that I the wrong POI any ideas?

Here's how I load the file:

public void loadFile(String FileName)
{
    InputStream is = null;
    try
    {
        //Connecting to MYSQL Database
        Class.forName(driver).newInstance();
        con = DriverManager.getConnection(url+dbName,userName,password);

        Statement stmt = (Statement) con.createStatement();
        ResultSet rs = stmt.executeQuery("SELECT FILE FROM doccompfiles WHERE FileName = '"+ FileName +"'");

        while(rs.next())
        {
            is = rs.getBinaryStream("FILE");
        }

        HWPFDocument doc = new HWPFDocument(is);
        WordExtractor we = new WordExtractor(doc);

        String[] paragraphs = we.getParagraphText();
        JOptionPane.showMessageDialog(null, "Number of Paragraphs" + paragraphs.length);
        con.close();
    }
    catch(Exception ex)
    {
        ex.printStackTrace();
    }
}

Stacktrace:

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:106)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at documentComparisor.Database.loadFile(Database.java:156)
at documentComparisor.Home$5.actionPerformed(Home.java:195)
at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(Unknown Source)
at java.awt.Component.processMouseEvent(Unknown Source)
at javax.swing.JComponent.processMouseEvent(Unknown Source)
at java.awt.Component.processEvent(Unknown Source)
at java.awt.Container.processEvent(Unknown Source)
at java.awt.Component.dispatchEventImpl(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Window.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$000(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)

解决方案

As you should know, at the moment MS Office documents exist in two different formats: one is the old format that was used by versions of MS Office before 2007 (e.g. ".doc" or ".xls"), another is XML-based format that's used by newer versions (e.g. ".docx" or ".xlsx").

There's different parts in Apache POI that handle different formats. Names of key classes for handling files in old MS Office format generally start with "H", names of the classes for working with files in XML-based format start with "X".

So in your example to handle new format you should use XWPFDocument instead of HWPFDocument:

XWPFDocument doc = new XWPFDocument(is);

这篇关于处理使用Apache POI的docx文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆