使用Python或其他方法从PDF提取到另一个PDF页面的链接 [英] Extracting links to pages in another PDF from PDF using Python or other method

查看:158
本文介绍了使用Python或其他方法从PDF提取到另一个PDF页面的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有5个PDF文件,每个文件都有指向另一个PDF文件中不同页面的链接.这些文件都是大PDF(每个〜1000页)的目录,这使手动提取成为可能,但非常麻烦.到目前为止,我已经尝试在Acrobat Pro中打开文件,并且可以右键单击每个链接并查看其指向的页面,但是我需要以某种方式提取所有链接.我不反对必须对链接进行大量的进一步分析,但是我似乎无法以任何方式将其拉出.我尝试将Acrobat Pro的PDF导出为HTML或Word,但两种方法都无法维护链接.

I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.

我精打细算,任何帮助都会很棒.我很喜欢使用Python或其他多种语言

I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages

推荐答案

使用 pyPdf 查找URI,

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

给予

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

我找不到链接到另一个pdf的文件,但我怀疑URI字段应包含file:///myfiles

I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles

这篇关于使用Python或其他方法从PDF提取到另一个PDF页面的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆