批量转换和裁剪后记为 pdf [英] batch convert and crop postscript to pdf

查看:126
本文介绍了批量转换和裁剪后记为 pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我所知道的几乎不足以在这个数字世界中生存.

我有许多单页的 postscript 文件(图表/图像),我希望将其转换为 pdf 并自动裁剪为一个窄框.我现在在 windows 上(我也使用 linux,所以不要犹豫,为 linux 发布代码)

我过去通过结合 Ghostscript gswin32c.exe 和 Calibre pdfmanipulate.exe 取得了成功.对于这里的许多人来说,这可能是一种熟悉的方法.

但由于多种原因,这种方法已经充满了问题.

在我升级"到 64 位 gswin64c.exe 后出现了一个问题.32 位版本的 gswin32c.exe 仍然可以在我的系统上运行,所以我不能抱怨太多.

在处理可能未正确编码的 postscript 文件时出现了另一个问题.似乎至少有两个问题,但我不确定哪个(如果有)负责,或者两者都有.一个问题是边界框线,例如%% 边界框:135 179 484 587并不总是放在从顶部算起的第二行.我知道这可能是一个问题.另一个问题是上面的边界框对应于 Ghostscript 中的纵向"方向,但裁剪遵循横向"方向.我还没有发现的另一个问题是,对于某些文件,裁剪似乎很随机.

所以这是我的 32 位方法(适用于高质量文件),然后是 64 位改编,但不起作用(可能是因为它在我的机器上调用了一些 pypdf 脚本,而不是 calibre 提供的修补脚本,如果我了解 https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551http://www.mobileread.com/forums/archive/index.php/t-103097.html,但我只是猜测,无论如何都不知道解决方法):

@echo off echo batch processing with Latex ps2pdf 其次是 Ghostscript gswin64c.exe 和 Calibre2 pdfmanipulate.exe for %%I in (*.ps,*.eps) do ( "C:\Program Files\MiKTeX2.9\miktex\bin\x64\ps2pdf" %%I ) for %%I in (*.pdf) do ( "C:\Program Files (x86)\Ghostscript\gs9.00\bin\gswin32c.exe" -dSAFER-dNOPAUSE -dBATCH-sDEVICE#bbox "%%I" 2>边界 "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe"crop -o "%%~nICropped32.pdf" -b 边界 "%%I" pause "C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH-sDEVICE#bbox "%%I" 2>边界 "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe"crop -o "%%~nICropped64.pdf" -b 边界 "%%I" pause )

上述 32 位方法适用于高质量文件,例如由 PSTricks 或 Maple 的标准 2D 绘图驱动程序生成的 Postscript 级别 3,但不适用于较旧的文件,例如.由 Maple 的经典绘图驱动程序生成的 Postscript level 2(如果有).

我找到了一些此类文件的解决方法.它包括使用来自 (MiKTeX) LaTeX 发行版的 epstopdf.它适用于那些 Maple 经典文件.不幸的是,它不适用于我几年前使用 PSTricks 和其他软件(如 Matlab)生成的其他一些 postscript 文件.

所以我需要进行几次转换并选择有效的转换.我想知道您是否有一些建议可以让我的生活更轻松.如果我能解决 BoundingBox 和 Portrait/Landscape 问题,我应该很满意.

我提前感谢您的任何建议.一个 linux 建议是可以接受的.我倾向于选择一种解决方案,该解决方案可以通过按一下返回"键来处理文件的多样性.

当然,我正在寻找一种无损类型的裁剪,它只包括解释边界框,而不是将其转换为(可能)质量较低的 pdf.

我忘了说.当我将 gswin32c/pdfmanipulate 应用于高质量的 3 级 postscript 文件时,名为bounding"的文件填充了如下信息:

%%BoundingBox:34 128 567 667%%HiResBoundingBox:34.364390 128.875004 566.054069 666.071980

在上面的例子中,文件已经被裁剪得差不多了.注意 %%BoundingBox 和 %%HiResBoundingBox

之间的接近度

但应用于低质量级别 2(或声称是)postscript 文件,边界"文件填充:

%%BoundingBox:189 137 574 467%%HiResBoundingBox:189.485994 137.843996 573.299983 466.668478

但边界框真的应该是%% 边界框:135 179 484 587上面的(135 179 484 587)是postscript文件本身提供的边界框(我通过复制粘贴移到了第二行),与Ghostview/Ghostscript在纵向时解释的边界框一致.

但它被 Ghostscript 完全忽略了...

我不知道 189 137 574 467 来自哪里——这是非常错误的...

EDIT 2. 针对 Ken 的问题,我想澄清几点:

Ken,感谢您的回复,

对不起,如果我的问题不清楚——不过你似乎已经理解了它的要点——让我依次回答你的问题:

<块引用>

我不确定您为什么使用 2 个应用程序,应该可以仅使用 Ghostscript 执行整个转换.

我没有找到使用 Ghostscript 完成所有操作的方法,所以我使用了另一种方法.我在这里找到了 Ghostscript/Calibrate 建议,http://www.mobileread.com/forums/archive/index.php/t-72885.html 和其他地方都尝试过,直到最近才有效.

我并不是说使用 Ghostscript 不可能做到这一切,我只是说我没有找到方法.

<块引用><块引用>

我升级"到 64 位 gswin64c.exe 后出现了一个问题"你还没有说问题是什么,你把它报告为一个错误吗?如果人们不报告错误,他们就不会得到修复......

我在这里提供了描述问题和错误报告的链接:https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551,http://www.mobileread.com/forums/archive/index.php/t-103097.html,我的问题是完全一样的.

<块引用>

您似乎对 PostScript 程序和注释有些混淆.PostScript 程序中任何以%"开头的行都是注释,对程序的运行没有影响.所以 BoundingBox 评论根本不会做任何事情.

如果可以的话,我不同意.取一个 postscript 文件,删除 %%Bounding Box,保存并在 Ghostview 中打开它.Ghostview 会抛出错误消息,然后在不使用边界框信息的情况下显示它,例如一个被大量空白包围的图形,而不是被边界框紧紧包围.所以是的,这个评论至少在 Ghostview 中做了一些事情.删除 %%Bounding Box 后,如果您然后使用 Calibre/pdfmanipulate 来裁剪 pdf,在 %%Bounding Box 可以工作的情况下,它会错误地裁剪它.所以这个注释"在显示和裁剪的上下文中非常有用.

<块引用>

注意没有要求它是文件的第二行.....

Adobe 推荐.引用自 adobe,

"第二个必需的 DSC 标头注释提供了有关EPS 文件的大小,并且必须存在,以便包含的应用程序可以正确转换和剪辑 EPS 文件.这是边界框注释."

http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

Adobe 说必须".就我个人而言,我不在乎它是否必须,只要我能从我的 eps 中生成 pdf 并正确界定.

<块引用>

一般来说,Ghostscript 会忽略 DSC 注释,但是如果您将 ProcessDSC 设置为 true,那么它的使用将非常有限(主要是用于设置页面大小的 BoundingBox 注释).

使用 pdfmanipulate 它可以区分正确裁剪的 pdf 和不正确裁剪的 pdf.

<块引用>

继续前进.您说您使用的是 LaTeX ps2pdf,如果您已经有 PostScript 文件,则可以将其发送到 Ghostscript 以转换为 PDF.我不清楚在这种情况下您使用 Ghostscript 究竟是为了什么,只是为了找到页面的真正边界框?

是的.

<块引用>

我不清楚你所说的无损"裁剪是什么意思,如果你裁剪内容,你肯定会清楚地丢失一些东西,即使它只是空白......

我的意思是我不希望裁剪过程对整个图像进行光栅化"(或者不管它叫什么,你会知道这个术语).裁剪掉的文件部分对我来说没有用,所以损失不大.裁剪中的文件部分应与原始文件具有相同的质量.这是总体思路.

您可以在此处找到相关评论,这是我找到有用信息的地方,http://www.charlietanksley.net/philtex/reading-pdfs-on-便携式/

<块引用>

如果您知道要裁剪到的尺寸,那么一次完成转换就很容易了,

不,我不知道大小,这就是为什么我要花这么长时间让软件为我计算它,这显然不是一件简单的事情,因为 Ghostscript 和 epstopdf 并不总是就最佳裁剪达成一致,一个对某些文件正确但对其他文件不正确,另一个对其他文件正确但对某些文件不正确...

<块引用>

如果您不知道大小,那么您可以只使用 Ghostscript 分 2 次完成,方法是首先像您所做的那样提取 BoundingBox.这将为您提供 4 个数字,即边界框的左下角和右上角(如果我没记错的话).然后创建一个翻译"PostScript 操作来向下和向左移动页面内容(使其从左下角 0,0 开始).您还创建了一个页面设备请求来设置页面大小,大小由 width = right - left 和 height = top - bottom 给出.将原始文件和 PostScript 操作符一起送入 Ghostscript 并选择 pdfwrite 设备,您将获得一个 PDF 文件.

如果您手边有一个批处理文件示例,那就太好了.我看过几个基于 pdfwrite 的例子,但我尝试过的例子都没有奏效.细节决定成败.

<块引用>

就边界框而言,它可能是一个错误,也可能是文件做了标记,可能在外部位置使用了白色墨水.在这种情况下,边界框设备仍会将其视为页面内容的一部分.您可能会看到它不是,但设备不能.考虑页面是否先用深色背景填充,然后使用白色墨水勾勒出内容.

这些文件都是用 Matlab、Maple、PSTricks 等软件创建的,不太可能(但显然不是不可能)在 %%Bounding Box 给出的区域之外有不可见的白色标记.

在许多情况下,%%Bounding Box 注释包含所有需要的信息,我想要 Ghostscript 或 Calibre 或 pdfwrite 或任何使用该信息的人.

<块引用>

如果不了解您想要做什么,并且最好能看到您的一个或多个有问题的文件,我就无法提供全面的解决方案.

那很容易,我如何发布一个 postscript 文件供您查看?它是 420 KB.

谢谢 Ken,希望我们能找到可行的解决方案.

EDIT 3. 我已经确定了问题的很大一部分.

我的 postscript 文件具有以下边界框,非常接近最佳裁剪:%%BoundingBox:135 179 484 587

当我运行 Ghostscript gswin64c/gswin32c 来计算边界框时,即

for %%I in (*.ps,*.eps) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

我明白了:

<块引用>

%%BoundingBox:145 189 475 574 %%HiResBoundingBox:145.331574189.485994 474.155986 573.299983

当我运行 ps2pdf 后跟 Ghostscript gswin64c 时,即

for %%I in (*.ps,*.eps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I)for %%I in (*.pdf) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> 边界)

我得到以下边界框:

<块引用>

%%BoundingBox:189 137 574 467 %%HiResBoundingBox:189.395994137.843996 573.299983 466.668478

所以问题是使用 ps2pdf 从 ps 到 pdf 的转换引入了边界框信息的变化,这导致了错误的裁剪.所以用其他东西替换 ps2pdf,比如 eps2pdf 解决了这里的问题.当然还有其他的解决方案.正如 Ken 和 luser droog 所建议的那样,特别有价值的是仅涉及 Ghostcript 的解决方案.他们非常有价值(并且优于我的快速修复)建议如下.这样的事情已经奏效了:

for %%I in (*.eps,*.ps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\epstopdf" %%I)对于 (*.pdf) 中的 %%I 做 ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2>边界"C:\Program Files (x86)\Calibre2\pdfmanipulate.exe"crop -o "%%~nICropped.pdf" -b 边界 "%%I")

解决方案

评论中的空间不足,无法添加此内容,因此恐怕我要发布另一个答案....

BoundingBox 对于 PDF 文件看起来是假的原因是 PDF 转换过程的一个特性.默认情况下,它会旋转页面直到大部分文本是水平的,对于这个文件(我认为其他文件也有同样的问题),这会导致顺时针旋转 90 度.

这当然意味着边界框也会旋转,检查值表明这就是发生的事情.所以 BoundingBox 是正确的旋转后的 PDF 文件.

现在,我通过私人电子邮件提供了几个 PostScript 程序,以下是我的内容:

1pass.ps

这会从源 PostScript 文件中读取 BoundingBox 行,并使用它来设置页面大小和偏移量.您可以通过设置SourceFileName"来传入要使用的文件名,例如,使用您提供的文件:

gs -sDEVICE=pdfwrite -sSourceFileName=classic.ps -o out.pdf 1pass.ps

将生成一个名为 out.pdf 的文件,该文件是读取 BoundingBox 的结果,并将其转换为 PDF 文件,并将页面裁剪为该大小.

<前>%!PS%% 重新定义 setpagedevice 以防止 PostScript 程序更改%% 但是用不同的名字保留一个副本,所以我们可以使用它./oldsetpagedevice/setpagedevice 加载定义/setpagedevice {pop} 绑定定义(要处理的文件是)打印 SourceFileName ==/SourceFile SourceFileName (r) 文件定义/BoxString 65535 字符串定义/LLx 0 定义/LLy 0 定义/URx 0 定义/URy 0 定义/FoundBox false def/获取值{token { % 读取 PostScript 标记/LLx exch def % 现在假设它是一个数字令牌{/LLy 交换定义令牌{/URx 交换定义令牌{/URy 交换定义pop % 删除任何剩余的字符串数据true % 返回成功代码}{(无法从字符串中读取数字)==false % 返回失败代码} 如果别的}{(无法从字符串中读取数字)==false % 返回失败代码} 如果别的}{(无法从字符串中读取数字)==false % 返回失败代码} 如果别的} {(无法从字符串中读取数字)==false % 返回失败代码} 如果别的绑定定义{SourceFile BoxString readline {(%%BoundingBox:) 锚搜索 {pop %% 丢弃匹配字符串GetValues %% 提取 BBox/FoundBox exch def %% 注意成功/失败exit %% 退出这个循环} {pop %% 丢弃字符串,不匹配} 如果别的} {(未能找到 %%BoundingBox 注释)==exit %% 没有更多数据,退出循环} 如果别的} 环形SourceFile closefile %% 关闭文件发现框{(LLx = ) 打印 LLx ==(LLy = ) 打印 LLy ==(URx = ) 打印 URx ==(URy = ) 打印 URy ==> 旧页面设备LLx 否定 LLy 否定翻译源文件名运行} 如果

2pass.ps

这旨在按照您当前的工作方式使用,与 1pass.ps 相比,它有两个优点:

  1. 它适用于 PDF 文件和 PostScript 文件,以及不包含 %%BoundingBox 注释的文件.
  2. BoundingBox 是准确的.

它的缺点是您必须对每个文件进行两次处理,一次获取边界框,一次创建 PDF 文件.

这需要两个参数,包含bbox设备输出的文件名,以及要转换的文件名.同样,使用您发送的文件,您可以像这样使用它:

第一个命令:

 gs \-sDEVICE=bbox \经典.ps 2>边界.txt

第二个命令:

 gs \-sDEVICE=pdfwrite \-sBoxFileName=bounding.txt \-sPostScriptFileName=classic.ps \-o out.pdf \2pass.ps

classic.ps 的 PostScript 代码:

<前>%!PS%% 重新定义 setpagedevice 以防止 PostScript 程序更改%% 但是用不同的名字保留一个副本,所以我们可以使用它./oldsetpagedevice/setpagedevice 加载定义/setpagedevice {pop} 绑定定义(文件中的边界框参数)打印 BoxFileName ==(要处理的文件是)打印 PostScriptFileName ==/BoxFile BoxFileName (r) 文件定义/BoxString 256 字符串定义/HiResBoxString 256 字符串定义/LLx 0 定义/LLy 0 定义/URx 0 定义/URy 0 定义BoxFile BoxString readline % 从文件中读取第一行{/BoxString exch def % 将字符串重新定义为我们读取的字符串}{(在换行读取 %%BoundingBox 之前遇到 EOF)==flush} 如果别的BoxFile HiResBoxString readline % 从文件中读取第一行{/HiResBoxString exch def % 将字符串重新定义为我们读取的字符串}{(在换行读取 %%HiResBoundingBox 之前遇到 EOF)==flush} 如果别的BoxFile closefile % 关闭文件BoxString (%%BoundingBox:) 锚搜索{pop % 去掉数学字符串token { % 读取 PostScript 标记/LLx exch def % 假设它是一个数字令牌{/LLy 交换定义令牌{/URx 交换定义令牌{/URy 交换定义pop % 删除任何剩余的字符串数据}{(无法从字符串中读取数字)==} 如果别的}{(无法从字符串中读取数字)==} 如果别的}{(无法从字符串中读取数字)==} 如果别的} {(无法从字符串中读取数字)==} 如果别的}{打印(不包含 BoundingBox)==} 如果别的(LLx = ) 打印 LLx ==(LLy = ) 打印 LLy ==(URx = ) 打印 URx ==(URy = ) 打印 URy ==> 旧页面设备LLx 否定 LLy 否定翻译PostScriptFileName 运行

I know barely enough to survive in this digital world.

I have many one-page postscript files (graphs/images) I wish to convert to pdf and automatically crop to a narrow box. I'm on windows right now (I do use linux too, so don't hesitate to post code for linux)

I have in the past been successful by combining Ghostscript gswin32c.exe and Calibre pdfmanipulate.exe. This is probably a familiar approach to many here.

But this approach has become fraught with problems, for several reasons.

One problem arose after I "upgraded" to the 64 bit gswin64c.exe. The 32 bit version gswin32c.exe still works on my system though, so I can't complain too much.

Another problem arose while dealing with postscript files that are perhaps improperly coded. There seems to be at least two problems, but I'm not sure which, if any, is responsible or if both are. One problem is that the bounding box line, e.g. %%BoundingBox: 135 179 484 587 is not always placed on the second line from the top. I understand that can be an issue. Another problem is that the bounding box above corresponds to a "Portrait" orientation in Ghostscript, but the cropping follows the "Landscape" orientation. Yet another problem I have not identified is that for some files the cropping seems quite random.

So here is my 32bit approach (which works for high quality files), followed by the 64bit adaptation which doesn't work (perhaps because it calls some pypdf script on my machine rather than the patched script provided by calibre, if I understand https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551 and http://www.mobileread.com/forums/archive/index.php/t-103097.html, but I'm just guessing and don't know a workaround anyhow):

@echo off echo batch processing with Latex ps2pdf followed by Ghostscript gswin64c.exe and Calibre2 pdfmanipulate.exe for %%I in (*.ps,*.eps) do ( "C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I ) for %%I in (*.pdf) do ( "C:\Program Files (x86)\Ghostscript\gs9.00\bin\gswin32c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped32.pdf" -b bounding "%%I" pause "C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped64.pdf" -b bounding "%%I" pause )

The above 32 bit approach works on high quality files, e.g. Postscript level 3 produced by PSTricks or by Maple's standard 2D plot driver, but doesn't on older files, eg. Postscript level 2 (if that) produced by Maple's classic plot driver.

I have found a workaround for some such files. It consists in using epstopdf from the (MiKTeX) LaTeX distribution. It works on those Maple classic files. Unfortunately it doesn't work on some other postscript files I generated several years ago with PSTricks and other software like Matlab.

And so I need to make several transformations and select the ones that worked. I wonder if you would have suggestions that would make my life easier. If I can fix the BoundingBox and Portrait/Landscape issues I should be quite content.

I thank you in advance for any suggestions. A linux suggestion would be acceptable. My preference will go for a solution that might be able to handle the diversity of files in one single push of the "return" key.

And of course I'm looking for a lossless type of cropping, one that consists only in interpreting the bounding box, but not in transforming it into a (possibly) lower quality pdf.

EDIT: I forgot to say. When I apply gswin32c/pdfmanipulate to a high quality level 3 postscript file, the file named "bounding" fills with information like:

%%BoundingBox: 34 128 567 667 %%HiResBoundingBox: 34.364390 128.875004 566.054069 666.071980

In the example above, the file was already pretty much cropped. Note the closeness between %%BoundingBox and %%HiResBoundingBox

but applied to a low quality level 2 (or so it claims to be) postscript file, the "bounding" file fills with :

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.485994 137.843996 573.299983 466.668478

but the bounding box really ought to be %%BoundingBox: 135 179 484 587 The above (135 179 484 587) is the bounding box provided by the postscript file itself (which I moved to the second line by copy-pasting) and it is consistent with the bounding box interpreted by Ghostview/Ghostscript when in the Portrait orientation.

But it gets completely ignored by Ghostscript...

I don't know where the 189 137 574 467 comes from --- it's very wrong...

EDIT 2. I'd like to clarify a few points, in response to Ken's questions:

Hi Ken, thanks for your reply,

sorry if my question was unclear --- nevertheless you seem to have understood the gist of it --- let me take your questions in turn:

I'm unsure why you are using 2 applications, it should be possible to perform the entire transformation with just Ghostscript.

I didn't find a way to do it all with Ghostscript so I used another way. I found the Ghostscript/Calibrate suggestion here, http://www.mobileread.com/forums/archive/index.php/t-72885.html, and elsewhere, tried it and it worked until recently.

I'm not saying it's not possible to do it all with Ghostscript, I'm merely saying that I didn't find a way to.

"One problem arose after I "upgraded" to the 64 bit gswin64c.exe" You haven't said what the problem was, have you reported it as a bug ? If people don't report bugs, they don't get fixed......

I gave the links describing the problem and the bug report, here: https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551, http://www.mobileread.com/forums/archive/index.php/t-103097.html, my problem is the exact same one.

You seem to have some confusion between PostScript programs and comments. Any line in a PostScript program beginning '%' is a comment, and has no effect on the operation of the program. So BoundingBox comments won't do anything at all.

I beg to differ, if I may. Take a postscript file, remove the %%Bounding Box, save and open it in Ghostview. Ghostview throws up error messages and then displays it without using the bounding box information, e.g. a figure surrounding by a lot of white space instead of tightly surrounded by the bounding box. So yes, this comment does something, within Ghostview at least. Having removed the %%Bounding Box, if you then use Calibre/pdfmanipulate to crop the pdf, it will crop it wrongly in cases where having the %%Bounding Box would have worked. So this "comment" is quite useful in the context of displaying and cropping.

Note there is no requirement for it to be the second line of the file.....

It is recommended by Adobe. Quoting from adobe,

"The second required DSC header comment provides information about the size of the EPS file and must be present so the including application can transform and clip the EPS file properly. This is the bounding box comment."

http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

Adobe say "must." Personally I couldn't care less if it's a must or not, as long as I can produce pdf from my eps that are properly bounded.

In general Ghostscript ignores DSC comments, however if you set ProcessDSC to true, then it will make very limited use of it (primarily the BoundingBox comment to set the page size).

with pdfmanipulate it makes all the difference between a properly cropped pdf and an improperly cropped one.

Moving on. You say you are using LaTeX ps2pdf, if you already have a PostScript file, you can send that to Ghostscript for conversion to PDF. Its not clear to me what exactly you are using Ghostscript for in this case, simply to find the real bounding box of the page ?

yes.

Its not clear to me what you mean by 'lossless' cropping, if you crop the content you must be losing something clearly, even if its just white space.....

I mean that I don't want the cropping process to "rasterize" (or whatever it's called, you will know the term) the whole image. The part of the file that is cropped out is not useful to me so it's not much of a loss. The part of the file that is within the crop should be of the same quality as the original. That's the general idea.

You can find comments about this here, which is one place where I found useful information, http://www.charlietanksley.net/philtex/reading-pdfs-on-portables/

Its easy enough to do the conversion in one pass if you know the size you want to crop to,

no I don't know the size, that's why I'm going to such lengths to have software calculate it for me, and it's obviously not a simple thing because Ghostscript and epstopdf don't always agree on the optimal crop, one getting it right for some files but not for others, the other getting it right for other files but not for some...

if you don't know the size then you can do it in 2 passes using only Ghostscript by first extracting the BoundingBox as you have done. That will get you 4 numbers, the bottom left and top right of the bounding box (if I remember correctly). You then create a 'translate' PostScript operation to move the content of the page down and left (so that it starts at 0,0, the bottom left corner). You also create a page device request to set the page size, the size being given by width = right - left and height = top - bottom. Feed the original file, along with the PostScript operators, to Ghostscript and select the pdfwrite device and you will get a PDF file.

A batch file example would be great, if you have one handy. I have seen several examples based on pdfwrite and none that I've tried have worked. The devil is in the detail.

As far as the bounding box goes, it may be a bug, or it may be that the file makes a mark, potentially using a white ink at the outside location. In this case the bounding box device will still regard it as part of the page content. You may be able to see that it isn't, but the device cannot. Consider if the page was first filled with a dark background, and the content outlined using white ink.

The files were all created with software such as Matlab, Maple, PSTricks and it's unlikely (but obviously not impossible) that there would be invisible white marks outside of the area given by the %%Bounding Box.

In many cases, the %%Bounding Box comment contains all the information that is needed and I'd like Ghostscript or Calibre or pdfwrite or whomever to use that information.

I cannot offer a comprehensive solution without understanding more about what you want to do, and ideally seeing one or more of your problematic files.

That would be very easy, how can I post a postscript file for your viewing? It's 420 kilobytes.

Thanks Ken, let's hope we can find a workable solution.

EDIT 3. I have identified a big part of the problem.

My postscript file has the following bounding box, pretty close to an optimal crop: %%BoundingBox: 135 179 484 587

When I run Ghostscript gswin64c/gswin32c to compute the bounding box, viz

for %%I in (*.ps,*.eps) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get:

%%BoundingBox: 145 189 475 574 %%HiResBoundingBox: 145.331574 189.485994 474.155986 573.299983

When I run ps2pdf followed by Ghostscript gswin64c, i.e.

for %%I in (*.ps,*.eps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I)
for %%I in (*.pdf) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get the following bounding box:

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.395994 137.843996 573.299983 466.668478

So the problem is that the conversion from ps to pdf with ps2pdf introduces a change in the bounding box information which results in incorrect cropping. So replacing ps2pdf with something else, like eps2pdf solves the problem here. Of course there are other solutions. Particularly valuable are solutions involving Ghostcript only, as suggested by Ken and luser droog. Their very valuable (and superior to my quick fix) suggestions are below. Something like this has worked:

for %%I in (*.eps,*.ps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\epstopdf" %%I)
for %%I in (*.pdf) do (
"C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding
"C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped.pdf" -b bounding "%%I"
)

解决方案

Insufficient space in comments to add this so I'm afraid I'm posting yet another answer....

The reason the BoundingBox looks bogus for the PDF file is because of a feature of the PDF conversion process. By default it rotates pages until the majority of the text is horizontal, in the case of this file (and, I presume other files with the same problem), this resulted in a rotation by 90 degrees clockwise.

This means of course the the bounding box rotates as well, and inspection of the values shows that this is what has happened. So the BoundingBox is correct for the rotated PDF file.

Now, I supplied a couple of PostScript programs by private email, here's what I put:

1pass.ps

This reads the BoundingBox line from the source PostScript file, and uses it to set up the page size and offset. You pass in the name of the file to use by setting 'SourceFileName' Eg, with the file you provided:

gs -sDEVICE=pdfwrite -sSourceFileName=classic.ps -o out.pdf 1pass.ps

will produce a file called out.pdf which is the result of reading the BoundingBox, and converting to a PDF file with a page cropped to that size.

%!PS  

%% redefine setpagedevice to prevent changes by the PostScript program  
%% But keep a copy under a different name, so we cna use it.  
/Oldsetpagedevice /setpagedevice load def  
/setpagedevice {pop} bind def  

(File to process is ) print SourceFileName ==  

/SourceFile SourceFileName (r) file def  
/BoxString 65535 string def  
/LLx 0 def  
/LLy 0 def  
/URx 0 def  
/URy 0 def  
/FoundBox false def  

/GetValues {  
  token {                   % read a PostScript token  
    /LLx exch def               % Assume its a number for now  
    token {  
      /LLy exch def  
      token {  
        /URx exch def  
        token {  
          /URy exch def  
          pop                       % Get rid of any remaining string data  
          true              % return success code  
        }{  
          (Failed to read a number from the string) ==  
          false             % return failure code  
        } ifelse  
      }{  
        (Failed to read a number from the string) ==  
        false               % return failure code  
      } ifelse  
    }{  
      (Failed to read a number from the string) ==  
      false                 % return failure code  
    } ifelse  
  } {  
    (Failed to read a number from the string) ==  
    false                   % return failure code  
  } ifelse  
} bind def  

{  
  SourceFile BoxString readline {  
    (%%BoundingBox:) anchorsearch {  
      pop                           %% discard matching string  
      GetValues             %% extract BBox  
      /FoundBox exch def        %% Note success/failure  
      exit                  %% exit this loop  
    } {  
      pop                   %% discard string, no match  
    } ifelse  
  } {  
    (Failed to find a %%BoundingBox comment) ==  
    exit                            %% No more data, exit loop  
  } ifelse  
} loop  

SourceFile closefile            %% close the file  

FoundBox {  
  (LLx = ) print LLx ==  
  (LLy = ) print LLy ==  
  (URx = ) print URx ==  
  (URy = ) print URy ==  
  > Oldsetpagedevice  
  LLx neg LLy neg translate  
  SourceFileName run  
} if  

2pass.ps

This is intended to be used the way you are currently working, it has two advantages over 1pass.ps:

  1. It works with PDF files as well as PostScript files, and with files which do not contain a %%BoundingBox comment.
  2. The BoundingBox is accurate.

It has the disadvantage that you have to process each file twice, once to get the bounding box and once to create the PDF file.

This takes two parameters, the name of the file containing the output of the bbox device, and the name of the file to be converted. Again, using the file you sent, you would use it like this:

First command:

  gs \
   -sDEVICE=bbox \
    classic.ps 2> bounding.txt

Second command:

  gs \
   -sDEVICE=pdfwrite \
   -sBoxFileName=bounding.txt \
   -sPostScriptFileName=classic.ps \
   -o out.pdf \
    2pass.ps

PostScript code for classic.ps:

%!PS  

%% redefine setpagedevice to prevent changes by the PostScript program  
%% But keep a copy under a different name, so we cna use it.  
/Oldsetpagedevice /setpagedevice load def  
/setpagedevice {pop} bind def  

(Bounding Box parameters in file ) print BoxFileName ==  
(File to process is ) print PostScriptFileName ==  

/BoxFile BoxFileName (r) file def  
/BoxString 256 string def  
/HiResBoxString 256 string def  
/LLx 0 def  
/LLy 0 def  
/URx 0 def  
/URy 0 def  

BoxFile BoxString readline  % Read first line from file  
{  
  /BoxString exch def       % redefine string to be the one we read  
}{  
  (Encountered EOF before newline reading %%BoundingBox) == flush  
} ifelse  

BoxFile HiResBoxString readline % Read first line from file  
{  
  /HiResBoxString exch def      % redefine string to be the one we read  
}{  
  (Encountered EOF before newline reading %%HiResBoundingBox) == flush  
} ifelse  

BoxFile closefile               % close the file  

BoxString (%%BoundingBox:) anchorsearch  
{  
  pop                       % Get rid of the mathcing string  
  token {                   % read a PostScript token  
    /LLx exch def               % Assume its a number  
    token {  
      /LLy exch def  
      token {  
        /URx exch def  
        token {  
          /URy exch def  
          pop                       % Get rid of any remaining string data  
        }{  
          (Failed to read a number from the string) ==  
        } ifelse  
      }{  
        (Failed to read a number from the string) ==  
      } ifelse  
    }{  
      (Failed to read a number from the string) ==  
    } ifelse  
  } {  
    (Failed to read a number from the string) ==  
  } ifelse  
}{  
  print (does not contain a BoundingBox) ==  
} ifelse  

(LLx = ) print LLx ==  
(LLy = ) print LLy ==  
(URx = ) print URx ==  
(URy = ) print URy ==  

> Oldsetpagedevice  
LLx neg LLy neg translate  

PostScriptFileName run  

这篇关于批量转换和裁剪后记为 pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆