提取PDF文档的特定部分 [英] Extract specific parts of PDF documents

查看:134
本文介绍了提取PDF文档的特定部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多(30)个PDF文件,每个包含48-96页。所有页面的布局是相同的,只有其他内容(数字,图表)。



背景:这些页面是光纤电缆测量的PDF报告,我必须通过电缆的衰减对它们进行分类。由于机密问题,我不幸的是不能给出一个示例文件。



为了验证这些报告,我们正在做一些控件样本,这就是为什么我需要报告排序。现在的问题是:如何仅将所有pdf文件中的所有页面的特定部分导出到某种格式我可以排序?



如前所述,它是非常具体的值位于页面上。它也已经被解析的内容,所以它在PDF文件中可以作为文本,因此它不被扫描,不需要OCR。



任何帮助是赞赏。我目前不知道如何解决这个问题,它可能是一些这样的工具,或者是一种编程方法来解决。

解决方案

正如您在评论中指出的那样,您准备编写一个解决方案。我建议使用Java和 iText PDF库。它可以让您从文档中提取文本,只要文本实际上是可提取的(实际上可以将字形放入PDF中,但将映射从字形删除为字符)。



您可以在 ExtractPageContent *第15章的示例中找到iText的PDF文本提取示例代码< a> iText in Action - 第2版。特别是 ExtractPageContentArea 对您的案例很感兴趣。



本质上,您只需要采取该示例并对其进行泛化,从而从页面上的多个区域中提取文本。


i have multiple (30) PDF files, each containig 48-96 pages. The layout of all pages is identical, there are just other contents (numbers, graphs).

Background: These pages are PDF Reports of fibre cable measurements, and I have to sort them by attenuation of the cables. Due to confidential issues, I unfortunatly cannot give an example file.

For verifying these reports, we are doing some control samples, thats why i need the reports sorted. The question now is: How can I export only very specific parts of all pages in all pdf files to some format i can sort?

As already mentioned, it is very specific where the values are located on the page. It is also already "parsed" content, so it is available "as text" in the PDF file, so it is not scanned, no OCR required.

Any help is appreciated. I currently have no idea how to solve that issue, it could be some tool which does something like that, or a programming approach to solve that.

解决方案

As you indicate in your comments to the original question, you are prepared to program a solution. I would propose using Java and the iText PDF library. It enables you to extract text from documents as long as the text actually is extractable (you actually can put glyphs into a PDF but drop the mappings from glyphs to characters).

You can find sample code for PDF text extraction with iText in the ExtractPageContent* samples for chapter 15 of iText in Action — 2nd Edition. Especially ExtractPageContentArea is of interest in your case.

Essentially you only have to take that sample and generalize it too extract the text from multiple areas on the page.

这篇关于提取PDF文档的特定部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆