/差异字典,用于PDF中的编码解析问题 [英] /Differences dictionary for encode parsing issue in PDF
问题描述
Type1字体/Differences
编码在值的映射中使用字符串,例如1个字符被编码为'one'.它仅用于数字和特殊字符.
使用这些编码的标准方法是什么?
我应该如何从使用这种编码的PDF解码字符串?
文件链接: http://www.filedropper.com/open
这是文件中的/Differences
数组(老实说,您应该发布此文件,而不是链接到skeevy下载页面):
/Differences [
24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39 /quotesingle
96 /grave
128 /bullet/dagger/daggerdbl/ellipsis...
]
这种工作方式是字体还具有与其关联的编码(例如/MacRoman
或/WinANSI
).对于Type 1字体,该字体内置了一种编码.然后,给定该编码的副本,则将差异应用于该编码.从数字开始(您的第一个是24),将条目24-31(包括首尾)更改为/breve
,/circumflex
等.
在Type 1字体中,有一个名为/CharStrings
的字典,该字典将字形的名称与将呈现它的数据/代码相关联.例如,如果获得一个代码为26的字符,则在编码数组(对于Type 1字体应为256个元素的数组)中查找该字符,并应用差异,您将获得名称/circumflex
.然后,您可以在CharStrings
词典中查找该字形,并提取字形数据并进行渲染.编码中不存在的任何字符都应设置为/.notdef
,然后将呈现表示未定义字符的形状(通常是一个空框).
现在您可能遇到的问题是,如何将这些字形名称转换为更有用的名称,例如Unicode? p>
如果您查看附件D,则会看到一组表,这些表定义了标准拉丁编码的字符集.您将创建一个查找表,该表将Adobe标准名称映射到Unicode.不幸的是,附件D中的表格不完整.幸运的是,Adobe在此处.该文件中有一个链接,该链接现在已失效,但很可能是要在此处链接.
Type1 font /Differences
encoding uses strings in mapping of values for example 1 character is encoded to 'one'. It is used for numbers and special characters only.
What is the standard way to use these encoding?
How should I decode string from PDF which uses such encoding?
Link for the file: http://www.filedropper.com/open
Here's the /Differences
array in your file (and honestly, you should have just posted this and not a link a skeevy download page):
/Differences [
24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39 /quotesingle
96 /grave
128 /bullet/dagger/daggerdbl/ellipsis...
]
The way this works is that the font also has an encoding associated with it (for example /MacRoman
or /WinANSI
). In the case of a Type 1 font, there is an encoding built into the font. Then given a copy of that encoding, you apply the differences to it. Start from the number (your first is 24), you change entries 24-31 inclusive to /breve
, /circumflex
and so on.
In Type 1 fonts, there is a dictionary called /CharStrings
, which an association of a name of a glyph with the data/code that will render it. If, for example, you get a character with code 26, you look it up in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex
. You then look that up in the CharStrings
dictionary, pull out the glyph data and render it. Any character that does not exist in the encoding should be set to /.notdef
which will then render an shape representing an undefined character (usually an empty box).
Now likely your problem is, how do I turn these glyph names in something that is more useful like, say Unicode?
If you look in Annex D, you'll see a set of tables that define the character sets for standard Latin encodings. You would make a lookup table that maps Adobe standard names to Unicode. Unfortunately, the tables in Annex D are incomplete. Fortunately, Adobe has a file that defines all of that for you here. There is a link in that file which is now dead, but most likely it was meant to go here.
这篇关于/差异字典,用于PDF中的编码解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!