提取PDF文档的特定部分 [英] Extract specific parts of PDF documents
问题描述
背景:这些页面是光纤电缆测量的PDF报告,我必须通过电缆的衰减对它们进行分类。由于机密问题,我不幸的是不能给出一个示例文件。
为了验证这些报告,我们正在做一些控件样本,这就是为什么我需要报告排序。现在的问题是:如何仅将所有pdf文件中的所有页面的特定部分导出到某种格式我可以排序?
如前所述,它是非常具体的值位于页面上。它也已经被解析的内容,所以它在PDF文件中可以作为文本,因此它不被扫描,不需要OCR。
任何帮助是赞赏。我目前不知道如何解决这个问题,它可能是一些这样的工具,或者是一种编程方法来解决。
正如您在评论中指出的那样,您准备编写一个解决方案。我建议使用Java和 iText PDF库。它可以让您从文档中提取文本,只要文本实际上是可提取的(实际上可以将字形放入PDF中,但将映射从字形删除为字符)。
您可以在 ExtractPageContent *第15章的示例中找到iText的PDF文本提取示例代码< a> iText in Action - 第2版。特别是 ExtractPageContentArea 对您的案例很感兴趣。
本质上,您只需要采取该示例并对其进行泛化,从而从页面上的多个区域中提取文本。
i have multiple (30) PDF files, each containig 48-96 pages. The layout of all pages is identical, there are just other contents (numbers, graphs).
Background: These pages are PDF Reports of fibre cable measurements, and I have to sort them by attenuation of the cables. Due to confidential issues, I unfortunatly cannot give an example file.
For verifying these reports, we are doing some control samples, thats why i need the reports sorted. The question now is: How can I export only very specific parts of all pages in all pdf files to some format i can sort?
As already mentioned, it is very specific where the values are located on the page. It is also already "parsed" content, so it is available "as text" in the PDF file, so it is not scanned, no OCR required.
Any help is appreciated. I currently have no idea how to solve that issue, it could be some tool which does something like that, or a programming approach to solve that.
As you indicate in your comments to the original question, you are prepared to program a solution. I would propose using Java and the iText PDF library. It enables you to extract text from documents as long as the text actually is extractable (you actually can put glyphs into a PDF but drop the mappings from glyphs to characters).
You can find sample code for PDF text extraction with iText in the ExtractPageContent* samples for chapter 15 of iText in Action — 2nd Edition. Especially ExtractPageContentArea is of interest in your case.
Essentially you only have to take that sample and generalize it too extract the text from multiple areas on the page.
这篇关于提取PDF文档的特定部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!