如何从excel文件中提取外部引用列表 [英] How to extract a list of external references from a excel file

查看:92
本文介绍了如何从excel文件中提取外部引用列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为(不断增长的)非结构化 excel 文档集合设置自动处理系统.该集合包含旧式 .xls 文件和新的 .xlsx 文件.在我基于 Java 的解决方案中,我已经在使用 Apache POI 工具包来分析文档.

I'm working on setting up an automated processing system for a (ever growing) unstructured collection of excel documents. The collection consists of both old-school .xls files and new .xlsx files. In my Java-based solution I am already making use of the Apache POI toolkit to analyse the documents.

我尚未解决的一个挑战是如何识别文档之间的链接以绘制依赖关系.我还没有弄清楚如何方便地提取外部引用列表.对于 .xlsx 文件,我有一个解决方法可以解压缩文件,并打开包含引用的 xml 文件.这有效但对于大型文档集合效率低下,并且也没有为 .xls 文件提供解决方案.

One challenges that I have not been able to tackle yet, is how to identify links between documents so as to chart dependencies. I have not yet been able to figure out how to conveniently extract a list of external references. For .xlsx files I have a workaround in place that unzips the file, and opens the xml file holding the references. This works but is inefficient for large document collections, and also does not provide a solution for .xls files.

我更喜欢不依赖于 Microsoft Office 或相关库的解决方案,因为该解决方案需要在 Linux 环境中运行.

I prefer to have a solution that is not dependent on Microsoft Office or associated libraries as the solution needs to run on a Linux environment.

POI 是否能够以某种方式做到这一点?如果没有,我可以进一步调查的建议库/工具/区域是什么?

Is POI capable of doing this somehow? If not, what would be suggested libraries/tools/area's that I could further investigate?

推荐答案

最终我完成了 POI 源代码,并使用反射来获取引用的外部工作簿的列表.以下代码经过测试可在 POI 3.11 beta 版上运行.

Ultimately I worked my way through the POI source code and used reflection to get a list of referenced external workbooks. The following code was tested to work on POI version 3.11 beta.

希望在代码中使用此方法的人请注意:因为它处理非公共方法和类,所以它可能会发生变化并且将来可能会中断.

Note for people looking to use this method in there code: Because it deals with non-public methods and classes, it is subject to change and may break in the future.

private LinkedList<String> getWorkbookReferences(HSSFWorkbook wb) {
    LinkedList<String> references = new LinkedList<>();

    try {
        // 1. Get InternalWorkbook
        Field internalWorkbookField = HSSFWorkbook.class.getDeclaredField("workbook");
        internalWorkbookField.setAccessible(true);
        InternalWorkbook internalWorkbook = (InternalWorkbook) internalWorkbookField.get(wb);

        // 2. Get LinkTable (hidden class)
        Method getLinkTableMethod;
        getLinkTableMethod = InternalWorkbook.class.getDeclaredMethod("getOrCreateLinkTable", null);

        getLinkTableMethod.setAccessible(true);
        Object linkTable = getLinkTableMethod.invoke(internalWorkbook, null);

        // 3. Get external books method
        Method externalBooksMethod = linkTable.getClass().getDeclaredMethod("getExternalBookAndSheetName", int.class);
        externalBooksMethod.setAccessible(true);

        // 4. Loop over all possible workbooks
        int i = 0;
        String[] names;
        try {
            while( true) {
                names = (String[]) externalBooksMethod.invoke(linkTable, i++) ;                     if (names != null ) {
                    references.add(names[0]);
                }
            }
        }
        catch  ( java.lang.reflect.InvocationTargetException e) {
            if ( !(e.getCause() instanceof java.lang.IndexOutOfBoundsException) ) {
                throw e;
            }
        }
    } catch (NoSuchFieldException | NoSuchMethodException | SecurityException | InvocationTargetException | IllegalAccessException e) {
        e.printStackTrace();
    }

    return references;
}

这篇关于如何从excel文件中提取外部引用列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆