是否有适用于PHP的PDF解析器? [英] Is there a PDF parser for PHP?

查看:78
本文介绍了是否有适用于PHP的PDF解析器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我了解php的多个PDF Generators (fpdf,dompdf等) 我想知道的是关于解析器的信息.

Hi I know about several PDF Generators for php (fpdf, dompdf, etc.) What I want to know is about a parser.

出于无法控制的原因,我需要的某些信息仅在pdf内的表格中 我需要提取该表并将其转换为数组.

For reasons beyond my control, certain information I need is only in a table inside a pdf and I need to extract that table and convert it to an array.

有什么建议吗?

推荐答案

我以前写过(出于类似的需求),我可以这么说:玩得开心.这是一个非常复杂的任务. PDF规范很大而且很笨重.有几种在其中存储文本的方法.而且最重要的是,每个PDF生成器在工作方式上都是不同的.因此,尽管TFPDF或DOMPDF之类的东西创建了非常易于阅读的PDF(从机器的角度来看),但Acrobat却制作了一些非常地狱的文档.

I've written one before (for similar needs), and I can say this: Have fun. It's quite a complex task. The PDF specification is large and unwieldy. There are several methods of storing text inside of it. And the kicker is that each PDF generator is different in how it works. So while something like TFPDF or DOMPDF creates REALLY easy to read PDFs (from a machine standpoint), Acrobat makes some really hellish documents.

原因是它如何编写文本.我使用过的大多数基于DOM的渲染器-将整行写为一个字符串,然后将其放置一次(这很容易阅读). Acrobat试图通过一次只写入一个或几个字符并将它们独立放置来提高效率(确实如此).虽然这确实简化了渲染,但使阅读更加困难.

The reason is how it writes the text. Most DOM based renderers --that I've used-- write the entire line as one string, and position it once (which is really easy to read). Acrobat tries to be more efficient (and it is) by writing only one or maybe a few characters at a time, and positioning them independently. While this REALLY simplifies rendering, it makes reading MUCH more difficult.

这里的好处是PDF格式本身非常简单.您具有遵循常规语法的对象".然后,您可以将它们链接在一起以生成内容.该规范在描述文件格式方面做得很好.但是现实世界中的阅读将需要一些脑力...

The up side here, is that the PDF format in itself is really simple. You have "objects" that follow a regular syntax. Then you can link them together to generate the content. The specification does a good job at describing the file format. But real world reading is going to take a bit of brain power...

如果您要自己写一些有用的建议,我必须认真学习:

Some helpful pieces of advice that I had to learn the hard way if you're going to write it yourself:

  1. Adob​​e喜欢重新映射字体.因此,字符65可能不会是A....您需要找到一个地图对象,并根据其中的字符推断出它在做什么.而且它是有效的,因为如果一个字符没有出现在该字体的文档中,则该字符也不包含该字体(如果您尝试以编程方式编辑PDF,这将使工作变得很困难)...
  2. 将其写得尽可能抽象.为每种对象类型和每种本机类型(字符串,数字等)编写类.让这些类为您解析.在那里会有很多重复,但是当您意识到只需要对一种特定类型的内容进行调整时,您就可以节省下来.)
  3. 编写一个或两个PDF规范的特定版本,并强制执行.检查版本号,如果它比您预期的要高,请保释...不要尝试使其正常工作".如果要支持较新的版本,请打破规范并从那里升级解析器.不要尝试反复尝试(这很有趣)...
  4. 祝你好运与压缩流.我发现通常情况下,您不能信任length参数来验证要解压缩的内容.有时候(对于某些生成器),它工作得很好……而另一些,则关闭了一个或多个字节.如果过滤器匹配,我只是尝试缩小它,然后强制长度...
  5. 测试长度时,请勿使用strlen.使用mb_strlen($string, '8bit'),因为它将补偿不同的字符集(并允许其他字符集中的潜在无效字符).
  1. Adobe likes to re-map fonts. So character 65 will likely not be A... You need to find a map object and deduce what it's doing based upon what characters are in there. And it is efficient since if a character doesn't appear in the document for that font, it doesn't include it (which makes life difficult if you try to programmatically edit a PDF)...
  2. Write it as abstract as possible. Write classes for each object type, and each native type (strings, numbers, etc). Let those classes parse for you. There will be a fair bit of repetition in there, but you'll save yourself in the end when you realize that you need to tweak something for only one specific type)...
  3. Write for a specific version or two of the PDF spec, and enforce it. Check the version number, and if it's higher than you expect, bail... And don't try to "make it work". If you want to support newer versions, break out the specification and upgrade the parser from there. Don't try to trial and error your way up (it's not fun)...
  4. Good luck with compressed streams. I've found that typically you can't trust the length arguments to verify what you are uncompressing. Sometimes (for some generators) it works well... Others it's off by one or more bytes. I just attempt to deflate it if the filter matches, and then force the length...
  5. When testing lengths, don't use strlen. Use mb_strlen($string, '8bit') since it will compensate for different character sets (and allow potentially invalid characters in other charsets).

否则,祝你好运...

Otherwise, best of luck...

这篇关于是否有适用于PHP的PDF解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆