如何从Shell脚本中分辨出扫描的PDF的分辨率? [英] How can I tell the resolution of scanned PDF from within a shell script?

查看:274
本文介绍了如何从Shell脚本中分辨出扫描的PDF的分辨率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量扫描为PDF格式的文档,并且我希望编写一个Shell脚本,将每个文档转换为 DjVu 格式.有些文档以200dpi扫描,有些以300dpi扫描,有些以600dpi扫描.由于DjVu是基于像素的格式,因此我想确保在目标DjVu文件中使用与扫描相同的分辨率.

I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned at 200dpi, some at 300dpi, and some at 600dpi. Since DjVu is a pixel-based format, I want to be sure I use the same resolution in the target DjVu file as was used for the scan.

有人知道我可以运行什么程序,或者如何编写程序,以确定用来生成扫描的PDF的分辨率吗? (像素数也可能起作用,因为几乎所有文档都是8.5 x 11英寸.)

Does anyone know what program I can run, or how I can write a program, to determine what resolution was used to produce a scanned PDF? (Number of pixels might work too as almost all documents are 8.5 by 11 inches.)

回复后的澄清:我知道布雷顿强调的困难,我愿意承认这个问题总体上是不恰当的,但我不是在问总则 PDF文件.我的特定文件来自扫描仪.它们每页包含一张扫描图像,每页包含相同的分辨率.如果我将PDF转换为PostScript,则可以手动四处查找并轻松找到像素尺寸.我可能需要更多工作才能找到图像尺寸.并且如果迫切需要,我可以修改gs正在使用的字典堆栈;很久以前,我为PostScript Level 1编写了一个解释器.

Clarification after responses: I'm aware of the difficulties highlighted by Breton, and I'm willing to concede that the problem in general is ill-posed, but I'm not asking about general PDF documents. My particular documents came out of a scanner. They contain one scanned image per page, same resolution each page. If I convert the PDF to PostScript I can poke around by hand and find pixel dimensions easily; I could probably find image sizes with more work. And if in desperate need I could modify the dictionary stack that gs is using; long ago, I wrote an interpreter for PostScript Level 1.

所有这些都是我要避免的.

All of that is what I'm trying to avoid.

感谢您的帮助,我在下面发布了答案:

Thanks to help received, I've posted an answer below:

  1. 使用identify从PDF中提取边界框,仅获取第一页的输出,并了解单位将是PostScript点,其中72英寸为英寸.
  2. 使用pdfimages从首页提取图像.
  3. 获取图像的高度和宽度.这次identify将给出像素数.
  4. 添加图像的全部区域以获得点的平方数.
  5. 要获得分辨率,请计算以平方英寸为单位的边框的面积,将以平方英寸为单位的点除以平方根,取平方根,然后四舍五入到最接近的10的倍数.
  1. Extract the bounding box from the PDF using identify, taking only the output for the first page, and understanding that the units will be PostScript points, of which there are 72 to an inch.
  2. Extract images from the first page using pdfimages.
  3. Get height and width of image. This time identify will give number of pixels.
  4. Add the total areas of the images to get the number of dots squared.
  5. To get resolution, compute areas of bounding box in inches squared, divide dots squared by inches squared, take the square root, and round to the nearest multiple of 10.

下面是关于脚本的完整答案.我在现场射击中使用它,效果很好.感谢Harlequin的pdfimages和Spiffeah的每页多幅图像警报(这很罕见,但我发现了一些).

Full answer with script is below. I'm using it in live fire and it works great. Thanks Harlequin for pdfimages and Spiffeah for the alert about multiple images per page (it's rare, but I've found some).

推荐答案

我猜想扫描是作为PDF中的图像包含在内的,因此您可以先使用pdfimages提取它们.然后,identify应该能够找到正确的数据.

I guess that the scans are included as images in the PDF, so you could use pdfimages to extract them first. Then, identify should be able to find the correct data.

这篇关于如何从Shell脚本中分辨出扫描的PDF的分辨率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆