将CID字体代码解码为等效的ASCII字符 [英] decode CID font codes to equivalent ASCII characters

查看:650
本文介绍了将CID字体代码解码为等效的ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一堆PDF中挖掘一些文本,其中一些已嵌入 CID字体在输出中:

I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output:

(cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92
(cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87)
(cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:80)(cid:82)(cid:85)(cid:72)(cid:3)(cid:87)
(cid:75)(cid:68)(cid:81)(cid:3)(cid:20)(cid:19)(cid:3)

当我查看PDF中的确切文本片段时,字母肯定可以转换为ASCII:

When I look at that exact snippet of text in the PDF, the letters are certainly convertible to ASCII:

这可能表明可以使用蛮力解码(读取与一堆CID代码相对应的文本片段,并以此方式创建映射),但这在很多情况下都是可靠的不同的PDF?从这些CID代码到ASCII字符是否存在可靠的映射,或者将高度依赖于PDF中的字体?如何确定(cid:72)之类的CID代码与哪个ASCII字符相对应?

This probably suggests that a brute force decoding would work (i.e. read a snippet of text that corresponds with a bunch of CID codes and create a mapping that way), but will this be reliable across lots of different PDFs? Is there a reliable mapping from these CID codes to ASCII characters or will that be highly dependent on the font in the PDF? How can I determine what ASCII character the a CID code like (cid:72) corresponds with?

关于它的价值,我使用PDFminer提取文本,这似乎是唯一实际报告CID代码的工具.如果有更好的工具可以将PDF转换为HTML或任何其他可解析的文本格式,我欢迎其他建议!

For what its worth, I'm extracting the text using PDFminer, which appears to be the only tool that actually reports the CID codes. If there is a better tool out there for converting PDFs to HTML or any other parsable text format, I'm open to other suggestions!

作为一个额外的奖励,该问题似乎与其他一些未回答问题相关,因此,在线上存在丰富的声誉悬赏:

As an added bonus, this question appears to be related to a few other unanswered questions, so there is a rich bounty of reputation on the line here:

  • Font cannot be extracted by PDFMiner
  • What is this (cid:51) in the output of pdf2txt?

推荐答案

虽然您可以通过猜测来完成此处的简单示例,但要真正正确地进行操作,则需要2条其他信息:

While you can probably do this by guesswork for the simple example here, to really do it correctly you'll need 2 additional pieces of information:

1)有关字体的Registry-Ordering-Supplement(ROS)信息.通常是类似于"Adob​​e-Japan1-5"之类的东西,并且是存储在字体中的信息性属性. ROS确定如何解释CID.除非ROS相同,否则一种字体中给定的CID不必与另一种字体中的CID相同.也就是说:Adobe-Japan1-5中的CID12345与Adobe-GB1-3中的CID12345形状不同!

1) The Registry-Ordering-Supplement (ROS) information for the font in question. This will usually be something like 'Adobe-Japan1-5' or some such and is an informational property stored in the font. The ROS determines how the CIDs are to be interpreted. A given CID in one font is not necessarily the same as a CID in another font, unless the ROSes are the same. That is to say: CID12345 in Adobe-Japan1-5 is not the same shape as CID12345 in Adobe-GB1-3!

2)有了ROS信息,选择兼容的CMap并对其进行解码. ASCII有点短视;我会使用ASCII是子集的Unicode.您可以在 https://github.com/上找到Adobe定义的ROS的CMap文件. adobe-type-tools/cmap-resources

2) Armed with the ROS info, select a compatible CMap and decode through that. ASCII is a bit short-sighted; I would go with Unicode of which ASCII is a subset. You can find CMap files for the Adobe-defined ROSes at https://github.com/adobe-type-tools/cmap-resources

可直接从 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆