将 PDF 渲染为图像并提取超链接 [英] Render PDF as image and extracting hyperlinks

查看:26
本文介绍了将 PDF 渲染为图像并提取超链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 imagemagick 将 PDF(由 pdfLaTex 生成)渲染为图像:

I use imagemagick to render a PDF (generated by pdfLaTex) as an image:

convert -density 120 test.pdf -trim test.png

然后我在 HTML 文件中使用此图像(以便在自己的 wiki 引擎中包含 Latex 代码).

Then I use this image in an HTML file (in order to include latex code in an own wiki engine).

当然,PNG 文件没有 PDF 文件包含的任何超链接.

But of course, the PNG file doesn't have any hyperlink the PDF file contains.

是否也有可能提取超链接的坐标和目标 URL,所以我可以构建一个 HTML 图像映射?

Is there any possibility to extract the coordinates and target URLs of the hyperlinks too, so I can build a HTML image map?

如果它有所作为:我只需要外部 (http://) 超链接,不需要 PDF 内部超链接.像 pdftohtml 这样的基于文本的解决方案是不可接受的,因为 PDF 也包含图形和公式.

If it makes a difference: I only need external (http://) hyperlinks, no PDF-internal hyperlinks. A text-based solution like pdftohtml would be unacceptable, since the PDFs contain graphics and formulars too.

推荐答案

Imagemagick 使用 Ghostscript 将 PDF 文件渲染为图像.您还可以使用 Ghostscript 提取链接注释.事实上,PDF 解释器已经为 pdfwrite 设备这样做了,因此它可以生成与原始文件具有相同超链接的 PDF 文件.

Imagemagick uses Ghostscript to render the PDF file to an image. You could also use Ghostscript to extract the Link annotations. In fact the PDF interpreter already does this for the benefit of the pdfwrite device, so that it can produce PDF files with the same hyperlinks as the original.

您需要进行少量 PostScript 编程,如果您需要更多详细信息,请告诉我.

You would need to do a small amount of PostScript programming, let me know if you want some more details.

在 gs/Resource/Init 中,文件 pdf_main.ps 包含 PDF 解释器的大部分内容.在那里你会发现:

In gs/Resource/Init the file pdf_main.ps contains large parts of the PDF interpreter. In there you will find this:

  /Link {
    mark exch
    dup /BS knownoget { << exch { oforce } forall >> /BS exch 3 -1 roll } if
    dup /F knownoget { /F exch 3 -1 roll } if
    dup /C knownoget { /Color exch 3 -1 roll } if
    dup /Rect knownoget { /Rect exch 3 -1 roll } if
    dup /Border knownoget {
....
    } if
    { linkdest } stopped 

该代码处理链接注释(PDF 文件中的超链接).您可以用 PostScript 代码替换linkdest",以将数据写入文件,这将为您提供超链接.请注意,您还需要在命令行上设置 -dDOPDFMARKS,因为对于无法使用它的渲染设备,通常会禁用此类处理.

That code processes Link annotations (the hyperlinks in the PDF file). You could replace the 'linkdest' with PostScript code to write the data to a file instead, which would give you the hyperlinks. Note that you would also need to set -dDOPDFMARKS on the command line, as this kind of processing is usually disabled for rendering devices, which can't make use of it.

这篇关于将 PDF 渲染为图像并提取超链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆