在Python中从PDF提取超链接 [英] Extract hyperlinks from PDF in Python

查看：609 发布时间：2020/5/25 4:05:09 python pdf hyperlink pypdf pdfminer

本文介绍了在Python中从PDF提取超链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有一些超链接的PDF文档，我需要从pdf中提取所有文本. 我已使用 http://www的PDFMiner库和代码.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/提取文本.但是，它不会提取超链接.

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

例如，我有一段文字，显示将此链接签出，并附带一个链接.我能够提取单词Check this link out，但是我真正需要的是超链接本身，而不是单词.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

我该怎么做?理想情况下，我更喜欢用Python来做，但是我也可以用其他任何语言来做.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

我看过itextsharp，但是还没有使用过.我正在Ubuntu上运行，希望对您有所帮助.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

推荐答案

我认为使用PyPDF可以做到这一点.如果要从PDF中提取链接.我不确定我从哪里得到的，但是它作为其他内容的一部分驻留在我的代码中.希望这会有所帮助:

I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

PDFFile = open('File Location','rb')

PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):

    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()

    if pageObject.has_key(key):
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
            print u[ank][uri]

我希望这会在您的PDF中提供链接. 附注:我尚未对此进行广泛的尝试.

This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

这篇关于在Python中从PDF提取超链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中从PDF提取超链接 [英] Extract hyperlinks from PDF in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中从PDF提取超链接 [英] Extract hyperlinks from PDF in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭