PDF文本字符串的编码 [英] Encoding of PDF text string

查看:156
本文介绍了PDF文本字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究用于PDF(文本提取)的解析器.

I am working on parser for PDF (text extraction).

需要对页面进行 Flate解码(通过zlib压缩)时,我的代码可以解压缩内容流,然后输出如下内容(流对象):

When page need to be Flate Decoded (from zlib compression), my code is able to decompress content streams, and then I have output (stream object) something like below:

BT
56.8 721.3 Td 
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET

我对字符串数组(TJ的操作数)感兴趣.

I am interested in the string array (operand of TJ).

似乎此数组中包含多个十六进制编码的字符串,但是相应的十六进制值没有意义.相反,它看起来像是010203 ... lz77压缩之类的序列.

It seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make sense. Instead it appears a sequence like 010203... sort of lz77 compression.

  • PDF是否具有多个压缩级别?
  • 如何从字符串数组上方获取纯文本?

推荐答案

Abhishek,

这不是一个简单的问题,不幸的是,它表明您尚未阅读PDF规范.您应该这样做.

This is far from an easy question and unfortunately it shows you have not read the PDF specification. You should do so.

您可以在此处下载Acrobat SDK: http://www.adobe.com/devnet/acrobat/sdk/eula. html

You can download the Acrobat SDK here: http://www.adobe.com/devnet/acrobat/sdk/eula.html

其中一部分是PDF规范,这是一个非常繁重的文档,解释了PDF的来龙去脉(包括对问题的回答).

Part of that is the PDF Specification which is a very hefty document explaining the ins and outs of PDF (including the answer to your question).

简而言之-而不是代替阅读文档-您正在寻找的是/F2 12 Tf命令设置的字体编码中的字符值,该命令设置随后写入文本时使用的特定字体.

In short - and not as a substitute to reading the documentation - what you're looking at are character values in the encoding of the font set by the /F2 12 Tf command which sets a particular font used when writing text subsequently.

这篇关于PDF文本字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆