我如何提取从PDF文件附件? [英] How do I extract attachments from a pdf file?
问题描述
我和重视他们的XML文件的大数目的PDF文档。我想提取这些附加的XML文件,并阅读。我怎样才能做到这一点编程方式使用.NET?
I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?
推荐答案
iTextSharp的也比较能够提取附件......呃......虽然你可能必须使用低级别对象这样做的。
iTextSharp is also quite capable of extracting attachments... ugh... though you might have to use the low level objects to do so.
有两种方法可以嵌入在PDF文件:
There are two ways to embed files in a PDF:
- 在文件注释
- 在文档级EmbeddedFiles。
在你无论从源文件规范词典,文件本身将在标记为EF(嵌入的文件)的流。
Once you have a file specification dictionary from either source, the file itself will be in a stream labeled "EF" (embedded file).
所以,列出在文档级别的所有文件,一会写code(在Java中)正是如此:
So to list all the files at the document level, one would write code (in Java) thusly:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfArray embeddedFiles = names.getAsArray(PdfName.EMBEDDEDFILES); //may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfName name = embeddedFiles.getAsName(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PRStream stream = (PRStream)fileSpec.getAsStream(PdfName.EF);
if (stream != null) {
files.put( PdfName.decodeName(name.toString()), stream.getBytes() );
}
}
这篇关于我如何提取从PDF文件附件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!