获取 PDF 文档的页数 [英] Get the number of pages in a PDF document

查看:35
本文介绍了获取 PDF 文档的页数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多小时以来,我一直在寻找一种快速、简单但大部分准确的方法来获取 PDF 文档中的页数.由于我在一家经常处理 PDF 的图形印刷和复制公司工作,因此在处理文档之前必须准确了解文档中的页数.PDF 文档来自许多不同的客户端,因此它们不是使用相同的应用程序生成的和/或不使用相同的压缩方法.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

以下是我发现不足或只是不起作用的一些答案:

Here are some of the answers I found insufficient or simply NOT working:

Imagick 需要大量安装,apache 需要重新启动,当我终于让它工作时,处理时间长得惊人(每个文档 2-3 分钟)并且它总是返回 1 页面在每个文档中(到目前为止还没有看到 Imagick 的工作副本),所以我把它扔掉了.getNumberImages()identifyImage() 方法都是如此.

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

FPDI 易于使用和安装(只需提取文件并调用 PHP 脚本),但是 FPDI 不支持许多压缩技术.然后它返回一个错误:

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

FPDF 错误:此文档 (test_1.pdf) 可能使用了 FPDI 附带的免费解析器不支持的压缩技术.

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

打开一个流并用正则表达式搜索:

这会在流中打开 PDF 文件并搜索某种字符串,其中包含页数或类似内容.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "//Counts+(d+)/";
$regex2 = "//PageW*(d+)/";
$regex3 = "//Ns+(d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;

  • //Counts+(d+)/(查找/Count )不起作用,因为只有少数文档有参数/Count 里面,所以大部分时间它不返回任何东西.来源.
  • //PageW*(d+)/(查找/Page)没有得到页数,大多包含一些其他数据.来源.
  • //Ns+(d+)/(查找 /N )也不起作用,因为文档可以包含多个/N 的值;大多数(如果不是全部)包含页数.来源.
    • //Counts+(d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
    • //PageW*(d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
    • //Ns+(d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.
    • 看下面的答案

      推荐答案

      一个简单的命令行可执行文件,名为:pdfinfo.

      可下载用于 Linux 和 Windows.您下载一个包含几个与 PDF 相关的小程序的压缩文件.将其提取到某处.

      A simple command line executable called: pdfinfo.

      It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

      其中一个文件是 pdfinfo(或用于 Windows 的 pdfinfo.exe).在 PDF 文档上运行返回的数据示例:

      One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

      Title:          test1.pdf
      Author:         John Smith
      Creator:        PScript5.dll Version 5.2.2
      Producer:       Acrobat Distiller 9.2.0 (Windows)
      CreationDate:   01/09/13 19:46:57
      ModDate:        01/09/13 19:46:57
      Tagged:         yes
      Form:           none
      Pages:          13    <-- This is what we need
      Encrypted:      no
      Page size:      2384 x 3370 pts (A0)
      File size:      17569259 bytes
      Optimized:      yes
      PDF version:    1.6
      

      我还没有看到返回错误页数的 PDF 文档(还没有).它也非常快,即使处理超过 200 MB 的大文档,响应时间也只有几秒钟或更短.

      I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

      有一种从输出中提取页数的简单方法,在 PHP 中:

      There is an easy way of extracting the pagecount from the output, here in PHP:

      // Make a function for convenience 
      function getPDFPages($document)
      {
          $cmd = "/path/to/pdfinfo";           // Linux
          $cmd = "C:\path\to\pdfinfo.exe";  // Windows
          
          // Parse entire output
          // Surround with double quotes if file name has spaces
          exec("$cmd "$document"", $output);
      
          // Iterate through lines
          $pagecount = 0;
          foreach($output as $op)
          {
              // Extract the number
              if(preg_match("/Pages:s*(d+)/i", $op, $matches) === 1)
              {
                  $pagecount = intval($matches[1]);
                  break;
              }
          }
          
          return $pagecount;
      }
      
      // Use the function
      echo getPDFPages("test 1.pdf");  // Output: 13
      

      当然这个命令行工具可以用在可以解析外部程序输出的其他语言中,但我在PHP中使用它.

      Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

      我知道它不是纯 PHP,但外部程序在 PDF 处理方面方式更好(如问题所示).

      I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

      我希望这可以帮助人们,因为我花了很多时间试图找到解决方案,而且我看到了很多关于 PDF pagecount 的问题,但我没有找到我想要的答案.这就是我提出这个问题并自己回答的原因.

      I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

      这篇关于获取 PDF 文档的页数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆