PDF内容流中的各种字形如何编码? [英] How are various glyphs encoded inside a PDF content stream?

查看:354
本文介绍了PDF内容流中的各种字形如何编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个输出PDF文档的程序.给定一系列由UTF-8编码的字符以及将用于呈现它的字体的名称,我想显示构成文档实际内容的适当字形.我希望能够显示诸如čö之类的国家字符.支持诸如 ae ffi 之类的连字.

问题是,我不知道如何指定要显示的实际字形(例如,在内容流内部).

例如,如果我想显示字符串"Hello World",则不必担心编码,只需编写(Hello World)Tj.然后,PDF阅读器将使用适当的字体来呈现此字符串.

但是如果我想显示字符串怎么办 很难整天阅读PDF规范. Prostědočistanemožné! 加上连字 ffi fi ea 以及捷克国家符号ěčé以给定的字体显示,怎么办?

我正在尝试通过PDF规范,但这并不容易.

  • 如何找出与给定字符或连字对应的字形代码"?
  • 该代码在PDF内容流中是如何编码的?

非常感谢您的帮助.


编辑:我可能高估了这个问题.计算显示通用欧洲文档"所需的字形,我想不出办法如何使这个数字超过256.如果我的假设是正确的,我可以完全重新映射字体的编码.这应该足以覆盖拉丁字母,数字,标点符号以及([之类的常见符号,并且我仍然有足够的空间容纳国家符号,连字和其他高质量字体. (如果字形的总数超过256,我可以实现一个优先级队列来选择最常用的连字.)

话虽如此,我认为我不需要使用CID键字体.

我仍然徘徊在如何将UTF-8编码的字符映射到任意字体的字形上.我有可用字体的AFM.例如,对于 DejaVu 字体,字符信息如下:

C 63 ; WX 536 ; N question ; B 67 -15 488 743 ;
C 64 ; WX 1000 ; N at ; B 65 -174 930 705 ;
C 65 ; WX 722 ; N A ; B -6 0 732 730 ;

但是在映射第256个 字符后,代码为-1:

C 255 ; WX 564 ; N ydieresis ; B -3 -223 563 767 ;
C -1 ; WX 722 ; N Amacron ; B -6 0 732 899 ;
C -1 ; WX 596 ; N amacron ; B 49 -15 568 746 ;

例如,如果我在输入中使用序列11100010 10000010 10101100(欧元符号),我怎么知道它对应的字形名称,以便可以在/Encoding词典中映射它?

解决方案

编码因字体类型而异.通常,有一个字体资源被定义为当前字体,并且在该字体字典中是对基本字体的引用以及一种描述编码的方式(通过/Encoding键).如果该键不存在,则编码将为标准",但是您可以使用其他简单编码,例如/MacRoman/WinAnsi作为编码值,或者可以指定标准编码和编码增量显示差异.

到目前为止很容易-只要您使用8位字符即可.对于许多早期的应用程序,他们将创建几种不同的字体,一种使用罗马编码,另一种将罗马字符映射为不可用的字符.为此,您的编码增量将包含对连字和其他通常未编码符号的引用.这非常适合Type 1字体,但是在TrueType字体一节中的规范中特别禁忌:

非符号字体应将MacRomanEncoding或WinAnsiEncoding指定为其Encoding条目的值,且不带Differences数组

当您要使用Unicode时,这有很大的不同.在这种情况下,您将使用CID字体(基于字符ID的字体).在这种情况下,字体会引用一个过程,该过程用于将字符串中的字符编码映射到字体中的字符ID(反之亦然).我强烈建议您阅读并完全理解PDF规范中有关复合字体的9.7节,其中介绍了将UTF16BE编码为字符串以使其正确呈现在PDF中所需的一切.绝对不平凡,因为有很多细节,如果错过这些细节,将导致Acrobat中呈现空白的页面.

作为一名软件工程师,他是专业编写可生成和使用PDF的代码的软件工程师,请允许我声明,当我不得不处理特殊情况以处理不符合规范的PDF时,我的一小部分就死了里面.拜托,拜托,甚至在至少通过Preflight之前,不要考虑将您制作的任何文档放到野外.这与"Acrobat渲染它,​​所以必须确定"不同.让我举个例子-我看到了许多文件,其中包含缺少FontDescriptor词典的关键元素的字体,包括/Ascent/Descent/CapHeight等.这些在Acrobat,但由于每个规范都是必需的,因此违反了规范.我知道Acrobat是如何处理的-它带有庞大的字体指标数据库,并且如果在文件中找不到它,则会查找值(哎呀,它甚至可能会忽略文件中的指标).我没有那么奢侈,所以我必须采取许多(可能昂贵/无效的)止损措施.

您可能要考虑使用图书馆为您完成这项工作-也许iText拥有足够不错的教育许可计划,因为据我了解,您是学生.也有一些基于C的库.也许您可以找到一种使GhostScript进行出价的方法.

如果您不愿意或无法遵循我的建议以切合规范或使用表面上这样做的库,请帮我至少在其中填写/Creator/Producer字符串拖车引用的文档信息字典(请参阅第14.3.3节和第7.5.5节).这样,当我不得不解析/使用/处理您的文档时,我将有一种方法可以直接在您的父母的父母身上撒散布.

让我们自上而下开始,从页面对象开始-我正在使用我自己库中的输出,并去除了我认为不需要的内容:

1 0 obj << 
    /Type /Page 
    /Parent 18 0 R 
    /Resources << 
       /Font << 
          /U0 13 0 R 
          >>
       /ProcSet [ /PDF /Text ]
       >>   
    /MediaBox [ 0 0 612 792 ]
    /Contents 19 0 R    
    /Dur -1 
    >>
 endobj

U0是对将用于unicode文本的字体的引用.

内容流旨在打印以下文本:Greek: Γειά σου κόσμος.

BT /U0 24 Tf 72 670 Td 
(\000G\000r\000e\000e\000k\000:\000 \003\223\003\265\003\271\003\254\000 \003\303\003\277\003\305\000 \003\272\003\314\003\303\003\274\003\277\003\302) 
Tj ET

引用的字体字典如下:

13 0 obj << 
    /BaseFont /DejaVuSansCondensed 
    /DescendantFonts [ 4 0 R  ]
    /ToUnicode 14 0 R 
    /Type /Font 
    /Subtype /Type0 
    /Encoding /Identity-H 
>>
endobj

具有/ToUnicode入口点的流指向包含以下PostScript代码的流:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1 beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end

CID定义字体规范.

和DescendantFonts数组指向此对象:

4 0 obj << 
    /Subtype /CIDFontType2 
    /Type /Font 
    /BaseFont /DejaVuSansCondensed 
    /CIDSystemInfo 7 0 R 
    /FontDescriptor 8 0 R 
    /DW 1000 
    /W 9 0 R 
    /CIDToGIDMap 10 0 R 
>>

CIDToGIDMap是带有实际映射的压缩流,CIDSystemInfo是<</Registry (Adobe) /Ordering (USC) /Supplement 0>>(这是一个引用,因为我在我输出的所有unicode字体中共享它.FontDescriptor是一个简单的样板,W数组是从字体指标中得出.

有了所有这些细节,您是否理解我为什么不轻率地说在污染环境之前就走开"?

我真的开始质疑这项任务的性质.编写简单的PDF是一回事,但是编写可处理任意OpenType/TrueType字体的完整Unicode代码需要您了解CID规范 TrueType规范(提示:我有完整的TrueType解析器可以提取字体中任何字形的所有度量,以便我可以输出/W数组.

但是,如果您只需要输出为Type 1字体,那么我的朋友,您的生活就容易了很多,因为您将带走整个UTF8流,将其读为unicode并包含每个出现的唯一字符在其中,您可以使用(\000\001\002\001\000),并将映射到带有差异字典为[0/a/b/c]的编码字典的字体.如果您有 n 个唯一字符,其中 n > 256个,则您将拥有( n /256)+ 1种字体,每种字体都有编码

如果您的老师/教授在短期内想要除Type 1字体以外的任何内容,则他/她对学生的期望不切实际,并且/或者对输出质量的期望不高.您应该问是否需要处理CID字体,如果是,则您的教授至少是个虐待狂.我,一个经验丰富的专业人员,花了大约4天的时间来编写用于提取宽度的TrueType解析器.我的优势在于(1)使用托管语言(C#)减少了将在C语言中困扰您的问题,并且还能够使用反射来自动进行解析;以及(2)当我没有中断时,我编写的固体代码比普通学生快大约10-20倍,所以我的32个小时会转化为320个小时的学习时间,或多或少(然后,我的代码与您的约束有所不同-它必须使用任何废话字体)优美地),因此,如果允许您窃取 stb 之类的东西,我们就称它为200或更少.那只是为了在字体描述符中获得一个特定的元素.

I am working on a program that outputs PDF documents. Given a sequence of UTF-8 encoded characters and the name of a font that shall be used to render it, I would like to show the appropriate glyphs that make the actual content of the document. I would like to be able to display national characters such as č or ö. It would be great to support ligatures like ae or ffi.

The problem is, I do not know how the actual glyphs to be shown are specified (inside a content stream, for example).

If I, for example, want to display the string "Hello World", I need not to worry about encoding, I simply write (Hello World)Tj. The PDF reader will then use the appropriate font to render this string.

But what if I wanted to show the string It is difficult to read the PDF specification all day. Prostě dočista nemožné! with the ligatures ffi, fi and ea and the Czech national symbols ě, č and é in a given font, how would I proceed?

I am trying to get through the PDF specification, but it is not easy.

  • How do I find out the "code of the glyph" that corresponds to a given character or ligature?
  • How is this code encoded within a PDF content stream?

Help is much appreciated.


Edit: I may have overestimated the problem. Counting the glyphs that are needed to display a "common European document", I cannot think of a way how this number could exceed 256. If my assumptions are correct, I can remap the encoding of the font completely. This should be sufficient to cover all common symbols of the latin alphabet, numbers, punctuation, common symbols like ( and [ and still I would have plenty of room for national symbols, ligatures and other elements of high-quality typography. (I can implement a priority queue to select the most used ligatures if the total number of glyphs shall exceed 256.)

That being said, I do not think I need to use the CID-keyed fonts.

Still I wander how do I map UTF-8 encoded characters onto glyphs of an arbitrary font. I have the AFM of the font available. For the DejaVu font, for example, character information go like this:

C 63 ; WX 536 ; N question ; B 67 -15 488 743 ;
C 64 ; WX 1000 ; N at ; B 65 -174 930 705 ;
C 65 ; WX 722 ; N A ; B -6 0 732 730 ;

But after the 256th character is mapped, the codes are -1:

C 255 ; WX 564 ; N ydieresis ; B -3 -223 563 767 ;
C -1 ; WX 722 ; N Amacron ; B -6 0 732 899 ;
C -1 ; WX 596 ; N amacron ; B 49 -15 568 746 ;

For example, if I had the sequence 11100010 10000010 10101100 (Euro sign) in my input, how would I know what glyph name it corresponds to so that I can map it in the /Encoding dictionary?

解决方案

Encoding varies based on the font type. Typically, there is a font resource that is defined as the current font and within that font dictionary is a reference to a base font and a means of describing the encoding (via the /Encoding key). If that key doesn't exist, the encoding will be "standard", but you can use other simple encodings such as /MacRoman and /WinAnsi for the value of the encoding, or you can specify a standard encoding and an encoding delta to show the differences.

Easy so far - as long as you're working with 8-bit characters. For many early apps, they would create a couple different fonts, one with say Roman encoding and another that maps roman characters to unavailable characters. In order to do that, your encoding delta would include references to the ligatures and other typically non-encoded symbols. This works great for Type 1 fonts, but is specifically contraindicated by the spec in the section on TrueType Fonts:

A nonsymbolic font should specify MacRomanEncoding or WinAnsiEncoding as the value of its Encoding entry, with no Differences array

This is vastly different when you want to use, say, Unicode. In which case you would be using a CID font (a font based on character IDs). In that case there is a procedure referenced by the font which is used to map from a character encoding in your string to a character ID in your font (and vice versa). I would strongly recommend that you read and fully understand section 9.7 in the PDF specification on Composite Fonts, which describes everything you need in order to encode UTF16BE into strings to get them to render properly in PDF. It is decidedly non-trivial in that there are a lot of details that if missed will result in a blank rendered page in Acrobat.

As a software engineer who professionally writes code that produces and consumes PDF, let me state that when I get tasked with having to put in special cases in my code to deal with non-spec compliant PDF, a little piece of me dies inside. Please, please, don't even think of releasing any documents you produce into the wild until they pass Preflight at the least. This is not the same as "Acrobat renders it so it must be OK." Let me give you an example - I've seen a number of files in the wild that include fonts that are missing the key elements of the FontDescriptor dictionary, including /Ascent, /Descent, /CapHeight, etc. These render in Acrobat, but are in violation of the spec since each of those is required. I know how Acrobat handles that - it comes with an enormous database of font metrics and looks up the value if it can't find it in the file (heck, it might even ignore the metrics in the file). I don't have that luxury, so I have to do a number of (potentially expensive/invalid) stop gap measures.

You might want to consider using a library to do this work for you - maybe iText which has a decent enough licensing scheme for education because, I get it, you're a student. There are some C based libraries too. Maybe you can figure a way to make GhostScript do your bidding.

If you are unwilling or unable to follow my advice with regards to cleaving to the specification or to use a library which ostensibly does so, please do me the favor of at least filling out the /Creator and /Producer strings in the Document Information Dictionary referenced by the trailer (see sections 14.3.3 and section 7.5.5). That way, when I have to parse/consume/manipulate your documents, I will have a way to directly cast aspersions on your parentage.

Let's go top down and start with the page object - I'm using output from my own library and am stripping out what I think you don't need:

1 0 obj << 
    /Type /Page 
    /Parent 18 0 R 
    /Resources << 
       /Font << 
          /U0 13 0 R 
          >>
       /ProcSet [ /PDF /Text ]
       >>   
    /MediaBox [ 0 0 612 792 ]
    /Contents 19 0 R    
    /Dur -1 
    >>
 endobj

U0 is a reference to a font that will be used for unicode text.

The content stream is intended to print the following text: Greek: Γειά σου κόσμος.

BT /U0 24 Tf 72 670 Td 
(\000G\000r\000e\000e\000k\000:\000 \003\223\003\265\003\271\003\254\000 \003\303\003\277\003\305\000 \003\272\003\314\003\303\003\274\003\277\003\302) 
Tj ET

The font dictionary referenced looks like this:

13 0 obj << 
    /BaseFont /DejaVuSansCondensed 
    /DescendantFonts [ 4 0 R  ]
    /ToUnicode 14 0 R 
    /Type /Font 
    /Subtype /Type0 
    /Encoding /Identity-H 
>>
endobj

Which has the /ToUnicode entry points to a stream containing the following PostScript code:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1 beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end

which is defined by the CID font specification.

and the DescendantFonts array points to this object:

4 0 obj << 
    /Subtype /CIDFontType2 
    /Type /Font 
    /BaseFont /DejaVuSansCondensed 
    /CIDSystemInfo 7 0 R 
    /FontDescriptor 8 0 R 
    /DW 1000 
    /W 9 0 R 
    /CIDToGIDMap 10 0 R 
>>

The CIDToGIDMap is a compressed stream with the actual map, the CIDSystemInfo is <</Registry (Adobe) /Ordering (USC) /Supplement 0>> (it's a reference because I share it among all unicode fonts that I output. The FontDescriptor is a straight forward boiler plate, and the W array is derived from the font metrics.

With all this detail, are you understanding why I don't say lightly, "walk away before you pollute my environment any furhter"?

I'm really beginning to question the nature of the this assignment. Writing a simple PDF is one thing, but writing code that can handle full unicode in any arbitrary OpenType/TrueType font requires you to understand the CID spec and the TrueType spec (hint: I have a full TrueType parser that can extract all the metrics for any glyph in a font so that I can output the /W array).

If, however, you are required to only output to Type 1 fonts, well my friend, your life got a whole lot easier, because you would take your entire UTF8 stream, read it as unicode and for every unique character that comes in, you build a map from a unicode character to a glyph name and an internal character number by using this table. The internal character number essentially the unique index of the character that came in mod. So for example, if you have less than 257 unique characters on the page, you will have exactly one font that is encoded to map to the characters in the order that the arrived. If you had "abcba" for input, the output string in pdf would be (\000\001\002\001\000) and would map to a font with an encoding dictionary with a differences array that would be [0/a/b/c]. If you have n unique characters where n > 256, you're going to have (n / 256) + 1 fonts, each with encodings.

If your teacher/professor wants anything but Type 1 fonts in a short period of time, s/he has unrealistic expectations for the students and/or low expectations for the quality of output. You should ask whether your are required to handle CID fonts and if you are, then your professor is at the very least a sadist. It took me, a seasoned professional, about 4 days to write a TrueType parser for extracting widths. I had the advantage of (1) using a managed language (C#) which cut down on concerns that will be biting your ass in C and was also able to use reflection to automate parsing and (2) when I don't have interruptions, I write solid code about 10-20 times faster than a typical student, so my 32 hours would translate into 320 student hours, more or less (then again, my code has different constraints than yours - it has to consume any crap font it gets gracefully), so let's call it 200 or less if you're allowed to steal something like stb. That's just for getting one particular element in the font descriptor.

这篇关于PDF内容流中的各种字形如何编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆