在 Python 中从 PDF 中提取超链接 [英] Extract hyperlinks from PDF in Python

查看:99
本文介绍了在 Python 中从 PDF 中提取超链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含几个超链接的 PDF 文档,我需要从 pdf 中提取所有文本.我使用了来自 http://www 的 PDFMiner 库和代码.endlesscurious.com/2012/06/13/scraping-pdf-with-python/ 提取文本.但是,它不会提取超链接.

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

例如,我的文字显示查看此链接,并附有链接.我能够提取单词Check this link out,但我真正需要的是超链接本身,而不是单词.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

我该怎么做?理想情况下,我更喜欢用 Python 来做这件事,但我也愿意用任何其他语言来做.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

我看过itextsharp,但没用过.我在 Ubuntu 上运行,希望得到任何帮助.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

推荐答案

这是个老问题,不过好像很多人都在看(包括我在尝试回答这个问题的时候),所以我来分享一下答案我想出了.附带说明一下,学习如何使用 Python 调试器 (pdb) 很有帮助,因此您可以即时检查这些对象.

This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.

可以使用 PDFMiner 获取超链接.复杂之处在于(就像很多关于 PDF 的内容一样),链接注释和链接文本之间实际上没有任何关系,只是它们都位于页面的同一区域.

It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.

这是我用来在 PDFPage 上获取链接的代码

Here is the code I used to get links on a PDFPage

annotationList = []
if page.annots:
    for annotation in page.annots.resolve():
        annotationDict = annotation.resolve()
        if str(annotationDict["Subtype"]) != "/Link":
            # Skip over any annotations that are not links
            continue
        position = annotationDict["Rect"]
        uriDict = annotationDict["A"].resolve()
        # This has always been true so far.
        assert str(uriDict["S"]) == "/URI"
        # Some of my URI's have spaces.
        uri = uriDict["URI"].replace(" ", "%20")
        annotationList.append((position, uri))

然后我定义了一个函数:

Then I defined a function like:

def getOverlappingLink(annotationList, element):
    for (x0, y0, x1, y1), url in annotationList:
        if x0 > element.x1 or element.x0 > x1:
            continue
        if y0 > element.y1 or element.y0 > y1:
            continue
        return url
    else:
        return None

我曾经搜索过我之前在页面上找到的 annotationList,以查看是否有任何超链接与我在页面上检查的 LTTextBoxHorizo​​ntal 占据相同的区域.

which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.

就我而言,由于 PDFMiner 将太多文本合并到文本框中,我遍历了每个文本框的 _objs 属性并查看了所有 LTTextLineHorizo​​ntal 实例,以查看它们是否与任何注释位置重叠.

In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.

这篇关于在 Python 中从 PDF 中提取超链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆