字体和映射 [英] Cidfonts and mapping

查看:1233
本文介绍了字体和映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,我已经对这个主题进行了一些研究,但标题表明我不是专家。所以这里的问题:我使用python和lib pdfminer从pdfs中提取一些文本。



我只尝试过使用拉丁字符的文档,它在大多数情况下效果很好,除非字体不是拉丁语/西方。现在错误的文档是使用日语字体的拉丁字符。 Adobe告诉我编码是 Adob​​e-Identity 。我得到的是字符的cid,我找不到cidmap相关。



我知道我没有使用正确的术语,我的意思是pdf告诉我 cid = 3 ,我知道char是一个空格。我已手动为 0x00-0xFF 范围内的字符写了一个映射。一些来源告诉它匹配mac-roman编码,其他不同意。其他来源说,它匹配OpenType映射,但我找不到任何 0xFF



你可以告诉我很困惑,所以你被邀请纠正我的术语,但我想要的是一个地图匹配我自己,但扩展范围 0x0100-0xFFFF



ETA: http://www.sas.upenn.edu/~jtigay/JapanVol.pdf

ETA2:我发现此 ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z 存档中的cid2code.txt是我正在寻找的那种地图。但对于所有这些字体,cid列似乎移位两个:cid1映射到空格。

ETA3:更正的编码

解决方案

您可能正在搜索Adobe开发人员支持技术说明#5078中提供的表。


Adob​​e-Japan1-6 CID键控字体的字符集


结合技术说明#5014提供的背景知识


Adob​​e CMap和CIDFont文件规范


很遗憾,您尚未提供 ;




$ b 在您更正问题后,现在要求特殊用途Adobe -Identity-0 ROS (ROS是/ Registry,/ Ordering和/ Supplement的缩写,它们表示CIDFont和CMap资源中存在的三个/ CIDSystemInfo字典元素)
而不是 Adob​​e-Japan1 - ?,上面的链接不再感兴趣。不过很遗憾,只要没有任何公开的ROS适用, Adob​​e-Identity 似乎是ROS的首选。因此,对于地图CID到unicode的问题,似乎没有通用的答案。



此外, / ToUnicode 您的PDF中的em> Times 字体全部如下:

  / CIDInit / ProcSet findresource begin 
12 dict begin
begincmap
/ CIDSystemInfo<
/注册表(Adobe)
/订购(UCS)
/补充0
> def
/ CMapName / Adob​​e-Identity-UCS def
/ CMapType 2 def
1 begincodespacerange
< 0000>< FFFF>
endcodespacerange
endcmap
CMapName currentdict / CMap defineresource pop
end
end

(这里的CIDSystemInfo有趣地不同于PDF字体对象本身,Adobe-Identity-0。)



根据PDF规范 ISO 32000-1:2008 9.10.3节,但是


它将使用beginbfchar,endbfchar,beginbfrange和endbfrange操作符来定义从字符代码到以UTF-16BE编码表示的Unicode字符序列。


因此,没有定义可用的映射,根据相同的规范,这些字体的其他方面意味着


没有办法确定字符代码表示什么,在这种情况下,符合的读者可以选择字符他们选择的代码。



Ok, I've done some research on the subject but as the title indicates I'm no expert. So here's the problem: I'm extracting some text from pdfs using python and the lib pdfminer.

I've only tried documents with latin characters and it works well in most cases, except if the font is not latin/western. The document that bugs me now is using latin characters from a japanese font. Adobe tells me the encoding is Adobe-Identity. All I get is the cid of the char and I can't find the cidmap related.

I know I'm not using the right terms, I mean the pdf tells me cid=3 and I know the char is a space. I've manually written a map for the chars in the range 0x00-0xFF. Some sources tells it matches the "mac-roman" encoding, other disagrees. Other sources says it match OpenType mapping but I couldn't find anything beyond 0xFF. And I've got cids >3000.

You can tell I'm very confused, so you're invited to correct my terminology but what I'd want is a map that matches my own but extended for the range 0x0100-0xFFFF.

ETA: the link to the bugging pdf http://www.sas.upenn.edu/~jtigay/JapanVol.pdf
ETA2: I found this ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z the cid2code.txt within the archive is the kind of map I'm looking for. But for all those fonts the cid column seems "shifted" by two: cid1 maps to space.
ETA3: corrected encoding

解决方案

You might be searching for the tables provided in the Adobe Developer Support Technical Note #5078

Adobe-Japan1-6 Character Collection for CID-Keyed Fonts

in combination with the background knowledge provided by the Technical Note #5014

Adobe CMap and CIDFont Files Specification.

Unfortunately you have not provided the document that bugs you; thus, I cannot check whether the link really is appropriate.

EDIT

As you corrected your question and are now asking for the special-purpose Adobe-Identity-0 ROS ("ROS" is an abbreviation for /Registry, /Ordering, and /Supplement, which represent the three /CIDSystemInfo dictionary elements that are present in CIDFont and CMap resources) instead of Adobe-Japan1-?, the links above aren't of interest for you anymore. Unfortunately, though, Adobe-Identity seems to be the ROS of choice whenever none of the public ROSes is applicable. Thus, there seems to be no generic answer to your question for a map CID to unicode.

Furthermore, the /ToUnicode maps of the Times fonts in your PDF all look like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

(Here the CIDSystemInfo interestingly differs from that in the PDF font object itself, Adobe-Identity-0.)

According to the PDF specification ISO 32000-1:2008 section 9.10.3, though,

it shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Thus, there is no usable mapping defined which according to the same spec in combinations with other aspects of those fonts implies that

there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

这篇关于字体和映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆