将CID字体代码解码为等效的ASCII字符 [英] decode CID font codes to equivalent ASCII characters

查看：650 发布时间：2020/11/9 19:48:11 python fonts pdfminer

本文介绍了将CID字体代码解码为等效的ASCII字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从一堆PDF中挖掘一些文本，其中一些已嵌入 CID字体在输出中:

I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output:

(cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92
(cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87)
(cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:80)(cid:82)(cid:85)(cid:72)(cid:3)(cid:87)
(cid:75)(cid:68)(cid:81)(cid:3)(cid:20)(cid:19)(cid:3)

当我查看PDF中的确切文本片段时，字母肯定可以转换为ASCII:

When I look at that exact snippet of text in the PDF, the letters are certainly convertible to ASCII:

这可能表明可以使用蛮力解码(即读取与一堆CID代码相对应的文本片段，并以此方式创建映射)，但这在很多情况下都是可靠的不同的PDF?从这些CID代码到ASCII字符是否存在可靠的映射，或者将高度依赖于PDF中的字体?如何确定(cid:72)之类的CID代码与哪个ASCII字符相对应?

This probably suggests that a brute force decoding would work (i.e. read a snippet of text that corresponds with a bunch of CID codes and create a mapping that way), but will this be reliable across lots of different PDFs? Is there a reliable mapping from these CID codes to ASCII characters or will that be highly dependent on the font in the PDF? How can I determine what ASCII character the a CID code like (cid:72) corresponds with?

关于它的价值，我使用PDFminer提取文本，这似乎是唯一实际报告CID代码的工具.如果有更好的工具可以将PDF转换为HTML或任何其他可解析的文本格式，我欢迎其他建议！

For what its worth, I'm extracting the text using PDFminer, which appears to be the only tool that actually reports the CID codes. If there is a better tool out there for converting PDFs to HTML or any other parsable text format, I'm open to other suggestions!

作为一个额外的奖励，该问题似乎与其他一些未回答问题相关，因此，在线上存在丰富的声誉悬赏:

As an added bonus, this question appears to be related to a few other unanswered questions, so there is a rich bounty of reputation on the line here:

Font cannot be extracted by PDFMiner
What is this (cid:51) in the output of pdf2txt?

将CID字体代码解码为等效的ASCII字符 [英] decode CID font codes to equivalent ASCII characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将CID字体代码解码为等效的ASCII字符 [英] decode CID font codes to equivalent ASCII characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭