在Python中从PDF提取超链接 [英] Extract hyperlinks from PDF in Python

查看:609
本文介绍了在Python中从PDF提取超链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有一些超链接的PDF文档,我需要从pdf中提取所有文本. 我已使用 http://www的PDFMiner库和代码.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/提取文本.但是,它不会提取超链接.

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

例如,我有一段文字,显示将此链接签出,并附带一个链接.我能够提取单词Check this link out,但是我真正需要的是超链接本身,而不是单词.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

我该怎么做?理想情况下,我更喜欢用Python来做,但是我也可以用其他任何语言来做.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

我看过itextsharp,但是还没有使用过.我正在Ubuntu上运行,希望对您有所帮助.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

推荐答案

我认为使用PyPDF可以做到这一点.如果要从PDF中提取链接.我不确定我从哪里得到的,但是它作为其他内容的一部分驻留在我的代码中.希望这会有所帮助:

I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

PDFFile = open('File Location','rb')

PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):

    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()

    if pageObject.has_key(key):
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
            print u[ank][uri]

我希望这会在您的PDF中提供链接. 附注:我尚未对此进行广泛的尝试.

This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

这篇关于在Python中从PDF提取超链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆