在Python中使用正则表达式解析PDF文件 [英] Parsing PDF file using Regular expressions in Python

查看:352
本文介绍了在Python中使用正则表达式解析PDF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python的re模块解析PDF文件中的一些对象元素.我的目标是使用正则表达式解析每个PDF对象. 以下是一个PDF对象示例:

I am trying to parse some object elements from a PDF file using re module of Python. My goal is to parse each PDF object using a regular expression. A PDF object example is the following:

1 0 obj
<<
    /Type /Catalog
    /Pages 2 0 R
>>
endobj
2 0 obj
<<
    /Type /Pages
    /Kids [ 3 0 R ]
    /Count 1
>>
endobj
...

当我使用"\d+\s\d+\sobj[\s,\S]*endobj"时,它不起作用(它一直在解析util最后一个endobj被发现).如何修改正则表达式以便分别解析每个对象(换句话说,从1 0 obj到endobj的部分)?

When I use "\d+\s\d+\sobj[\s,\S]*endobj" it doesn't work (it keeps parsing util last endobj is found). How can I modify regular expression in order to parse each object seperately (in other words the part from 1 0 obj until endobj)?

推荐答案

如果仅使用正则表达式,则很容易构造一个程序无法处理的PDF文件. PDF词典和列表可以包含其他对象.正则表达式不能处理递归结构,至少不能处理Python re模块.

If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.

pdf文件是对象和流的树:

A pdf file is a tree of objects and streams:

  • 字典:<<(名称值)* >>
  • 列表:[(值)* ]
  • 名称:/(常规字符)*
  • 字符串:((char)* )
  • 十六进制字符串:<(hexchar)* >
  • 数字:(-)? ((数字)+ |(数字)+ .(数字)* | .(数字)+)
  • 布尔值:true | false
  • 引用:(数字)+(空格)+(数字)+(空格)+ R
  • Dictionaries: << (name value)* >>
  • Lists: [ (value)* ]
  • Names: / (regular char)*
  • Strings: ( (char)* )
  • Hex strings: < (hexchar)* >
  • Numbers: (-)? ((digit)+ | (digit)+ . (digit)* | . (digit)+)
  • Booleans: true | false
  • References: (digit)+ (whitespace)+ (digit)+ (whitespace)+ R

在大多数地方,空白和注释都将被忽略. 注释以%开头,一直运行到该行的末尾.

Whitespace and comments are ignored in most places. Comments start with % and run until the end of the line.

间接对象指定为:

1 0 obj
(any object)
endobj

然后可以将该对象引用为1 0 R.间接词典也可以附加流:

This object can then be referenced as 1 0 R. Indirect dictionaries can also have a stream attached:

1 0 obj
<<
/Length 22
>>
stream
(22 bytes of raw data)
endstream
endobj

PDF文件看起来像这样:

A PDF file looks something like this:

%PDF-1.4
%ÿÿÿÿ
1 0 obj
<< /Author (MizardX) >>
endobj
2 0 obj
<<
/Type /Catalog
% more required keys
>>
endobj
%lots of more indirect objects, one after another
trailer
<<
/Info 1 0 R
/Root 2 0 R
% ... more required keys
>>
xref
0 3
0000000000 65535 f
0000000015 00000 n
0000000054 00000 n
startxref
225
%%EOF

对象树的根是trailer对象.每个对象都直接或间接地从此字典中引用.

The root of the object tree is the trailer object. Every objects is referenced directly or indirectly from this dictionary.

流中隐藏了很多复杂性,但这并不影响文件结构.

There are a lot more complexity hidden inside the streams, but that does not affect the file structure.

完整规范可在 Adob​​e网站中找到.

这篇关于在Python中使用正则表达式解析PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆