使用iTextSharp(C#)从PDF中提取嵌入式XML [英] Extract Embedded XML from PDF with iTextSharp (C#)

查看:125
本文介绍了使用iTextSharp(C#)从PDF中提取嵌入式XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用C#提取嵌入在破产法庭文件中的XML数据。在PDF Reader中,该文件看起来像一个典型的法庭文档。在记事本中,XML隐藏在文本中。我尝试用这个<提取文本/ a>和另一段代码使用SimpleTextExtractionStrategy。第一个结果是文件中没有来自PDF和第二个输出符号的可识别文本。我也试过将它作为AcroField和Xfaform访问。它似乎不是基于Watch窗口的那些。

I need to extract XML data embedded in Bankruptcy court files with C#. In PDF Reader the file looks like a typical court doc. In Notepad the XML is buried in the text. I've tried extracting the text with this and another code snippet using SimpleTextExtractionStrategy. The first results in a file with no identifiable text from the PDF and the second outputs symbols. I also tried accessing it as an AcroField and Xfaform. It doesn't seem to be either of those based on the Watch window.

通过Visual Studio中的代码,XML显示在PDFReader >> Catalog >> Keys >> Raw >>非公共成员>>手表中的字典窗口。我不知道怎么做到这一点。由于它与Watch中的其他PDFNames一起列出,我认为我可以通过PDFReader.Catalog.GetAsDict访问它,但它不会显示为PDFName。这些文件的提供者有一个似乎只读取文本的Java应用程序。不确定我是否需要使用不同的提取策略,或者直接访问包含XML的目录项。我从来没有以编程方式使用PDF文件或iTextSharp,所以我很挣扎。任何代码建议?

Stepping thru the code in Visual Studio, the XML shows up under PDFReader >> Catalog >> Keys >> Raw >> Non-Public Members >> dictionary in the Watch window. I have no idea how to get to it though. Since it's listed with other PDFNames in Watch I thought I might be able to access it via PDFReader.Catalog.GetAsDict, but it doesn't display as a PDFName. The provider of these files has a java app that seems to just reads the text. Not sure if I need to use a different extraction strategy, or directly access the catalog item containing the XML. I've never programmatically worked with PDF files or iTextSharp so I'm struggling. Any code suggestions?

推荐答案

如果您可以与嵌入式XML共享PDF,将会有所帮助。当我第一次阅读您的问题时,我认为XML将被添加为文档级附件(存储在EmbeddedFiles中)或作为附件注释(存储在Annot中添加到页面词典中)。

It would help if you could share a PDF with an embedded XML. When I first read your question, I assumed that the XML would have been added as a document-level attachment (stored in EmbeddedFiles) or as an attachment annotations (stored in an Annot added to a page dictionary).

阅读 uscourts.gov 上的内容,看起来好像XML实际上是一个XMP流。这意味着您可以在目录的元数据条目中找到它(或者可能在页面字典中)。

Reading what is written on the uscourts.gov, it looks as if the XML is actually an XMP stream. That would mean that you can find it in the Metadata entry of the Catalog (or maybe in a page dictionary).

如果您无法共享该文件,您将拥有帮助自己。您可以通过下载 iText RUPS 来完成此操作。这是一个免费的工具,可以在里面找到

If you can not share the file, you will have to help yourself. You can do this by downloading iText RUPS. It is a free tool to look inside a PDF.

浏览树状结构并寻找元数据,查找 EmbeddedFiles ,查找 Annots 。如果您没有告诉我们XML是如何嵌入的,那​​么没有人能够帮助您。

Browse the tree structure and look for Metadata, look for EmbeddedFiles, look for Annots. If you don't tell us how the XML is embedded, nobody will be able to help you.

请参阅我对以下问题的回答示例:如何使用itext删除PDF的附件
(看看我如何使用RUPS查看目录>名称>嵌入式文件)。

See my answer to the following question for an example: How to delete attachment of PDF using itext (look at how I use RUPS to look at the Catalog > Names > EmbeddedFiles).

额外注释:到目前为止您尝试过的代码是关于从页面中提取文本 NOT 关于提取嵌入PDF中的XML文件。

Extra notes: the code you've tried so far is about extracting text from a page, NOT about extracting an XML file that is embedded inside a PDF.

更新:

现在您已共享文件,我已使用RUPS查找XML文件。看看下面的截图:

Now that you've shared a file, I've used RUPS to find the XML file. Take a look at the following screen shot:

你看到这里发生了什么吗?有人添加了一个名为 / USCTbankruptcynotice 的自定义条目,并将 String 作为值直接添加到目录中。这是错误的:在字符串中存储文件是一个坏主意。为什么开发人员不将该文件存储为流?我对使用这样的开发人员的人感到非常难过。

Do you see what happened here? Somebody added a custom entry named /USCTbankruptcynotice with a String as value straight to the catalog. That is so wrong: it is such a bad idea to store a file inside a string. Why didn't that developer store that file as a stream? I feel so sad for the person who employs such a developer.

这就是说,这就是你如何提取XML:

This being said, this is how you can extract the XML:

PdfDictionary catalog = reader.Catalog;
PdfName name = new PdfName("USCTbankruptcynotice");
PdfString USCTbankruptcynotice = catalog.GetAsString(key);
string xml = USCTbankruptcynotice.ToString();

这是从内存中写的。如果您需要应用小修正,请更新我的答案。

This is written from memory. Please update my answer if you need to apply small corrections.

这篇关于使用iTextSharp(C#)从PDF中提取嵌入式XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆