将PDF渲染为图像并提取超链接 [英] Render PDF as image and extracting hyperlinks

查看:276
本文介绍了将PDF渲染为图像并提取超链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用imagemagick来渲染PDF(由pdfLaTex生成)作为图像:

I use imagemagick to render a PDF (generated by pdfLaTex) as an image:

convert -density 120 test.pdf -trim test.png

然后我在HTML文件中使用这个图像(为了包含乳胶代码在自己的wiki引擎中)。

Then I use this image in an HTML file (in order to include latex code in an own wiki engine).

但是,当然,PNG文件没有任何PDF文件包含的超链接。

But of course, the PNG file doesn't have any hyperlink the PDF file contains.

是否有可能提取超链接的坐标和目标URL,因此我可以构建HTML image map

Is there any possibility to extract the coordinates and target URLs of the hyperlinks too, so I can build a HTML image map?

如果它有所作为:我只需要外部(http://)超链接,而不需要PDF内部超链接。像 pdftohtml 这样的基于文本的解决方案将是不可接受的,因为PDF中也包含图形和公式。

If it makes a difference: I only need external (http://) hyperlinks, no PDF-internal hyperlinks. A text-based solution like pdftohtml would be unacceptable, since the PDFs contain graphics and formulars too.

推荐答案

Imagemagick使用Ghostscript将PDF文件渲染为图像。您也可以使用Ghostscript来提取链接注释。事实上,PDF解释器已经为了pdfwrite设备的利益做到了这一点,因此它可以生成与原始文件具有相同超链接的PDF文件。

Imagemagick uses Ghostscript to render the PDF file to an image. You could also use Ghostscript to extract the Link annotations. In fact the PDF interpreter already does this for the benefit of the pdfwrite device, so that it can produce PDF files with the same hyperlinks as the original.

您需要做一些PostScript编程,让我知道你是否想要更多的细节。

You would need to do a small amount of PostScript programming, let me know if you want some more details.

在gs / Resource / Init中,pdf_main.ps文件包含PDF的大部分翻译。在那里你会发现这个:

In gs/Resource/Init the file pdf_main.ps contains large parts of the PDF interpreter. In there you will find this:

  /Link {
    mark exch
    dup /BS knownoget { << exch { oforce } forall >> /BS exch 3 -1 roll } if
    dup /F knownoget { /F exch 3 -1 roll } if
    dup /C knownoget { /Color exch 3 -1 roll } if
    dup /Rect knownoget { /Rect exch 3 -1 roll } if
    dup /Border knownoget {
....
    } if
    { linkdest } stopped 

该代码处理链接注释(PDF文件中的超链接)。你可以用PostScript代替'linkdest'来代替将数据写入文件,这会给你超链接。请注意,您还需要在命令行上设置-dDOPDFMARKS,因为这种处理通常对渲染设备禁用,无法使用它。

That code processes Link annotations (the hyperlinks in the PDF file). You could replace the 'linkdest' with PostScript code to write the data to a file instead, which would give you the hyperlinks. Note that you would also need to set -dDOPDFMARKS on the command line, as this kind of processing is usually disabled for rendering devices, which can't make use of it.

这篇关于将PDF渲染为图像并提取超链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆