/差异字典,用于PDF中的编码解析问题 [英] /Differences dictionary for encode parsing issue in PDF

查看:116
本文介绍了/差异字典,用于PDF中的编码解析问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Type1字体/Differences编码在值的映射中使用字符串,例如1个字符被编码为'one'.它仅用于数字和特殊字符.

使用这些编码的标准方法是什么?

我应该如何从使用这种编码的PDF解码字符串?

文件链接: http://www.filedropper.com/open

解决方案

这是文件中的/Differences数组(老实说,您应该发布此文件,而不是链接到skeevy下载页面):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

这种工作方式是字体还具有与其关联的编码(例如/MacRoman/WinANSI).对于Type 1字体,该字体内置了一种编码.然后,给定该编码的副本,则将差异应用于该编码.从数字开始(您的第一个是24),将条目24-31(包括首尾)更改为/breve/circumflex等.

在Type 1字体中,有一个名为/CharStrings的字典,该字典将字形的名称与将呈现它的数据/代码相关联.例如,如果获得一个代码为26的字符,则在编码数组(对于Type 1字体应为256个元素的数组)中查找该字符,并应用差异,您将获得名称/circumflex.然后,您可以在CharStrings词典中查找该字形,并提取字形数据并进行渲染.编码中不存在的任何字符都应设置为/.notdef,然后将呈现表示未定义字符的形状(通常是一个空框).

现在您可能遇到的问题是,如何将这些字形名称转换为更有用的名称,例如Unicode?

如果您查看附件D,则会看到一组表,这些表定义了标准拉丁编码的字符集.您将创建一个查找表,该表将Adobe标准名称映射到Unicode.不幸的是,附件D中的表格不完整.幸运的是,Adobe在此处.该文件中有一个链接,该链接现在已失效,但很可能是要在此处链接.

Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'. It is used for numbers and special characters only.

What is the standard way to use these encoding?

How should I decode string from PDF which uses such encoding?

Link for the file: http://www.filedropper.com/open

解决方案

Here's the /Differences array in your file (and honestly, you should have just posted this and not a link a skeevy download page):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

The way this works is that the font also has an encoding associated with it (for example /MacRoman or /WinANSI). In the case of a Type 1 font, there is an encoding built into the font. Then given a copy of that encoding, you apply the differences to it. Start from the number (your first is 24), you change entries 24-31 inclusive to /breve, /circumflex and so on.

In Type 1 fonts, there is a dictionary called /CharStrings, which an association of a name of a glyph with the data/code that will render it. If, for example, you get a character with code 26, you look it up in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex. You then look that up in the CharStrings dictionary, pull out the glyph data and render it. Any character that does not exist in the encoding should be set to /.notdef which will then render an shape representing an undefined character (usually an empty box).

Now likely your problem is, how do I turn these glyph names in something that is more useful like, say Unicode?

If you look in Annex D, you'll see a set of tables that define the character sets for standard Latin encodings. You would make a lookup table that maps Adobe standard names to Unicode. Unfortunately, the tables in Annex D are incomplete. Fortunately, Adobe has a file that defines all of that for you here. There is a link in that file which is now dead, but most likely it was meant to go here.

这篇关于/差异字典,用于PDF中的编码解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆