如何从Excel文件中提取外部引用列表 [英] How to extract a list of external references from a excel file
问题描述
我正在为(不断增长的)非结构化excel文档集合建立一个自动处理系统.该集合由旧式.xls
文件和新的.xlsx
文件组成.在基于Java的解决方案中,我已经在使用Apache POI工具包来分析文档.
I'm working on setting up an automated processing system for a (ever growing) unstructured collection of excel documents. The collection consists of both old-school .xls
files and new .xlsx
files. In my Java-based solution I am already making use of the Apache POI toolkit to analyse the documents.
我尚无法解决的一个挑战是如何识别文档之间的链接以绘制依赖关系图.我还无法弄清楚如何方便地提取外部引用列表.对于.xlsx
文件,我有一个解决方法,将文件解压缩,然后打开保存引用的xml文件.这种方法有效,但是对于大型文档集合而言效率低下,并且也没有为.xls
文件提供解决方案.
One challenges that I have not been able to tackle yet, is how to identify links between documents so as to chart dependencies. I have not yet been able to figure out how to conveniently extract a list of external references. For .xlsx
files I have a workaround in place that unzips the file, and opens the xml file holding the references. This works but is inefficient for large document collections, and also does not provide a solution for .xls
files.
我希望有一个不依赖于Microsoft Office或相关库的解决方案,因为该解决方案需要在Linux环境中运行.
I prefer to have a solution that is not dependent on Microsoft Office or associated libraries as the solution needs to run on a Linux environment.
POI是否可以执行此操作?如果没有,那么我可以进一步研究的建议的库/工具/领域是什么?
Is POI capable of doing this somehow? If not, what would be suggested libraries/tools/area's that I could further investigate?
推荐答案
最终,我通过POI源代码完成了自己的工作,并使用了反射来获取所引用的外部工作簿的列表.以下代码经过测试可在POI版本3.11 beta上运行.
Ultimately I worked my way through the POI source code and used reflection to get a list of referenced external workbooks. The following code was tested to work on POI version 3.11 beta.
为那些希望在此处的代码中使用此方法的人提供注意:由于它处理非公共方法和类,因此可能会发生更改,并且将来可能会中断.
Note for people looking to use this method in there code: Because it deals with non-public methods and classes, it is subject to change and may break in the future.
private LinkedList<String> getWorkbookReferences(HSSFWorkbook wb) {
LinkedList<String> references = new LinkedList<>();
try {
// 1. Get InternalWorkbook
Field internalWorkbookField = HSSFWorkbook.class.getDeclaredField("workbook");
internalWorkbookField.setAccessible(true);
InternalWorkbook internalWorkbook = (InternalWorkbook) internalWorkbookField.get(wb);
// 2. Get LinkTable (hidden class)
Method getLinkTableMethod;
getLinkTableMethod = InternalWorkbook.class.getDeclaredMethod("getOrCreateLinkTable", null);
getLinkTableMethod.setAccessible(true);
Object linkTable = getLinkTableMethod.invoke(internalWorkbook, null);
// 3. Get external books method
Method externalBooksMethod = linkTable.getClass().getDeclaredMethod("getExternalBookAndSheetName", int.class);
externalBooksMethod.setAccessible(true);
// 4. Loop over all possible workbooks
int i = 0;
String[] names;
try {
while( true) {
names = (String[]) externalBooksMethod.invoke(linkTable, i++) ; if (names != null ) {
references.add(names[0]);
}
}
}
catch ( java.lang.reflect.InvocationTargetException e) {
if ( !(e.getCause() instanceof java.lang.IndexOutOfBoundsException) ) {
throw e;
}
}
} catch (NoSuchFieldException | NoSuchMethodException | SecurityException | InvocationTargetException | IllegalAccessException e) {
e.printStackTrace();
}
return references;
}
这篇关于如何从Excel文件中提取外部引用列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!