在Python中从PDF提取超链接 [英] Extract hyperlinks from PDF in Python
问题描述
我有一个带有一些超链接的PDF文档,我需要从pdf中提取所有文本. 我已使用 http://www的PDFMiner库和代码.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/提取文本.但是,它不会提取超链接.
I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
例如,我有一段文字,显示将此链接签出,并附带一个链接.我能够提取单词Check this link out
,但是我真正需要的是超链接本身,而不是单词.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out
, but what I really need is the hyperlink itself, not the words.
我该怎么做?理想情况下,我更喜欢用Python来做,但是我也可以用其他任何语言来做.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
我看过itextsharp
,但是还没有使用过.我正在Ubuntu
上运行,希望对您有所帮助.
I have looked at itextsharp
, but haven't used it. I'm running on Ubuntu
, and would appreciate any help.
推荐答案
我认为使用PyPDF可以做到这一点.如果要从PDF中提取链接.我不确定我从哪里得到的,但是它作为其他内容的一部分驻留在我的代码中.希望这会有所帮助:
I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:
PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
我希望这会在您的PDF中提供链接. 附注:我尚未对此进行广泛的尝试.
This I hope should give the links in your PDF. P.S: I haven't extensively tried this.
这篇关于在Python中从PDF提取超链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!