如何从excel文件中提取外部引用列表 [英] How to extract a list of external references from a excel file
问题描述
我正在为(不断增长的)非结构化 excel 文档集合设置自动处理系统.该集合包含旧式 .xls
文件和新的 .xlsx
文件.在我基于 Java 的解决方案中,我已经在使用 Apache POI 工具包来分析文档.
I'm working on setting up an automated processing system for a (ever growing) unstructured collection of excel documents. The collection consists of both old-school .xls
files and new .xlsx
files. In my Java-based solution I am already making use of the Apache POI toolkit to analyse the documents.
我尚未解决的一个挑战是如何识别文档之间的链接以绘制依赖关系.我还没有弄清楚如何方便地提取外部引用列表.对于 .xlsx
文件,我有一个解决方法可以解压缩文件,并打开包含引用的 xml 文件.这有效但对于大型文档集合效率低下,并且也没有为 .xls
文件提供解决方案.
One challenges that I have not been able to tackle yet, is how to identify links between documents so as to chart dependencies. I have not yet been able to figure out how to conveniently extract a list of external references. For .xlsx
files I have a workaround in place that unzips the file, and opens the xml file holding the references. This works but is inefficient for large document collections, and also does not provide a solution for .xls
files.
我更喜欢不依赖于 Microsoft Office 或相关库的解决方案,因为该解决方案需要在 Linux 环境中运行.
I prefer to have a solution that is not dependent on Microsoft Office or associated libraries as the solution needs to run on a Linux environment.
POI 是否能够以某种方式做到这一点?如果没有,我可以进一步调查的建议库/工具/区域是什么?
Is POI capable of doing this somehow? If not, what would be suggested libraries/tools/area's that I could further investigate?
推荐答案
最终我完成了 POI 源代码,并使用反射来获取引用的外部工作簿的列表.以下代码经过测试可在 POI 3.11 beta 版上运行.
Ultimately I worked my way through the POI source code and used reflection to get a list of referenced external workbooks. The following code was tested to work on POI version 3.11 beta.
希望在代码中使用此方法的人请注意:因为它处理非公共方法和类,所以它可能会发生变化并且将来可能会中断.
Note for people looking to use this method in there code: Because it deals with non-public methods and classes, it is subject to change and may break in the future.
private LinkedList<String> getWorkbookReferences(HSSFWorkbook wb) {
LinkedList<String> references = new LinkedList<>();
try {
// 1. Get InternalWorkbook
Field internalWorkbookField = HSSFWorkbook.class.getDeclaredField("workbook");
internalWorkbookField.setAccessible(true);
InternalWorkbook internalWorkbook = (InternalWorkbook) internalWorkbookField.get(wb);
// 2. Get LinkTable (hidden class)
Method getLinkTableMethod;
getLinkTableMethod = InternalWorkbook.class.getDeclaredMethod("getOrCreateLinkTable", null);
getLinkTableMethod.setAccessible(true);
Object linkTable = getLinkTableMethod.invoke(internalWorkbook, null);
// 3. Get external books method
Method externalBooksMethod = linkTable.getClass().getDeclaredMethod("getExternalBookAndSheetName", int.class);
externalBooksMethod.setAccessible(true);
// 4. Loop over all possible workbooks
int i = 0;
String[] names;
try {
while( true) {
names = (String[]) externalBooksMethod.invoke(linkTable, i++) ; if (names != null ) {
references.add(names[0]);
}
}
}
catch ( java.lang.reflect.InvocationTargetException e) {
if ( !(e.getCause() instanceof java.lang.IndexOutOfBoundsException) ) {
throw e;
}
}
} catch (NoSuchFieldException | NoSuchMethodException | SecurityException | InvocationTargetException | IllegalAccessException e) {
e.printStackTrace();
}
return references;
}
这篇关于如何从excel文件中提取外部引用列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!