提取PDF注释/注释 [英] Extracting PDF annotations/comments

查看:826
本文介绍了提取PDF注释/注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个相当复杂的打印工作流程,其中控件是使用Adobe Reader或Adobe Acrobat为生成的PDF文档的草稿版本添加注释和注释。作为工作流程的一部分,应解析带有注释和注释的导入PDF文档,并将注释导入CMS系统(与PDF一起)。

We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).

问:是有没有可靠的工具(首选Python或Java)以
清晰可靠的方式提取PDF文件?

Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?

推荐答案

这段代码应该可以胜任。 问题的答案之一 从pdf中解析注释对于让我编写下面的代码非常有帮助。它使用poppler库来解析注释。这是 annotations.pdf 的链接。

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.

代码

import poppler, os.path

path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]

for page_no, page in enumerate(pages):
    items = [i.annot.get_contents() for i in page.get_annot_mapping()]
    items = [i for i in items if i]
    print "page: %s comments: %s " % (page_no + 1, items)

输出

page: 1 comments: ['This is an annotation'] 
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text'] 

安装

在Ubuntu上安装n如下。

On Ubuntu the installation as as follows.

apt-get install python-poppler

这篇关于提取PDF注释/注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆