如何从pdf文件中提取所有链接? [英] How to extract all links from pdf file?

查看:559
本文介绍了如何从pdf文件中提取所有链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照标准,链接隐藏在注释中(规范的12.5.6.5节).从那里提取地址很容易: http://blah-blah.com ".如何不仅从注释中提取链接,还从文本本身中提取链接?我可以搜索整个文本并找到像"http://"这样的词,但是还有更好的解决方案吗? PDF编辑器也在突出显示文本链接,他们如何知道这部分文本是超链接?

By standard, links are hiding in Annotations (section 12.5.6.5 from specifications). It is easy to extract address from there: Extracting links to pages in another PDF from PDF using Python or other method But very often links are presented not like special objects in document, but as plain text like "http://blah-blah.com". How do I extract not only links from annotations, but links from text itself? I can search through the whole text and finding words like "http://", but is there more optimal solution? PDF editors are highlighting text-links too, how do they know that this piece of text is hyperlink?

推荐答案

不幸的是,未保存为注释而是仅嵌入内容文本中的URL在PDF中没有特殊的可见性.除了搜索PDF的完整文本和URL的模式匹配之外,没有其他解决方案.

Sadly, URLs not saved as annotations but simply embedded in the content text have no special visibility in PDFs. There is no solution other than searching the complete text of the PDF and pattern matching for URLs.

这篇关于如何从pdf文件中提取所有链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆