如何处理 PDFMiner 提取的文本中的 CID? [英] What to do with CIDs in text extracted by PDFMiner?

查看:57
本文介绍了如何处理 PDFMiner 提取的文本中的 CID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些印地语的 PDF,并且有可提取的文本.我使用

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like:

如您所见,有许多字符被转换为(cid :number)"形式.

As one can see, there are a number of characters that are converted into the form "(cid :number)".

在进一步分析中,我发现 PDF 包含将字符代码映射到字形索引的 CMAP.因此,CID 是 CMAP 表中它映射到的字形的字符标识.

On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.

但是这些字符代码与 Unicode 值有什么关系呢?基本上,PDF 查看器如何使用此映射显示字形?

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

此外,根据对类似问题的评论,此过程可能不合法.但我并不是要窃取某人的字体.我要正文.这个过程如何成为非法的?

Moreover, according to a comment to this similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?

既然有很多这样的问题,我想澄清一下,我的目的不是解决cid"问题.我想澄清问题的原因和违法的原因.

Since there are many questions like this one, I want to clarify that I do not aim at solving the "cid" problem. I want to clarify the reasons for the problem and reasons for it's illegality.

这个问题页面 for pdfminer 讨论了这个问题,作者明确表示似乎没有可靠的解决方法来解决这个问题.是否存在一些通用的、基本的限制(例如,无法访问字体)使此问题持续存在?

This issues page for pdfminer discusses this issue, where the author clearly says that there seems to be no reliable workaround for this issue. Is there some general, basic limitation (like, no access to font) that makes this issue continual?

推荐答案

但是这些字符代码与 Unicode 值有什么关系呢?基本上,PDF 查看器如何使用此映射显示字形?

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

人们在 PDF 内容流中找到的字符代码不需要以任何明显的方式与 Unicode 值相关联.特别是,PDF 查看器根本不需要字符代码的 Unicode 代码点来显示匹配的字形.

The character codes one finds in the PDF content streams do not need to be related to Unicode values in any obvious way. In particular, a PDF viewer does not at all need a Unicode code point for a character code to show the matching glyph.

在 PDF 中,字体在字体程序中具有从字符代码到字形 ID 的映射(或一系列映射),而这种映射可能是完全任意的.

In a PDF a font has a mapping (or a sequence of mappings) from character code to glyph ID in the font program, and this mapping may be completely arbitrary.

例如在嵌入字体子集的情况下,子集字体程序通常是通过为页面上使用的第一个字形提供起始字形 ID n 来创建的,然后在该页面上提供第二个不同的字形 ID n+1,然后是下一个,不同的glyph id n+2 等等,然后字符代码通常与glyph id 相同,即上面的映射是身份映射.如果不再有附加信息,文本提取器就没有机会正常工作.

E.g. in case of embedded font subsets the subset font program often is created by giving the first glyph used on a page a starting glyph id n, then giving the second, different glyph on that page id n+1, then the next, different glyph id n+2 etc etc, and then the character codes often are identical to the glyph ids, i.e. the mapping above is the identity mapping. If there are no additional information anymore, a text extractor has no chance to properly do its job.

我想澄清问题的原因

常规文本提取通常有以下选项来查找字符代码的 Unicode 值:

Regular text extraction usually has the following options to find the Unicode value for a character code:

  • PDF 字体可能包括一个 ToUnicode 映射(从字符代码映射到 Unicode)以支持搜索字符串或复制 & 等操作.粘贴到 PDF 查看器中.该映射立即提供了文本提取器所需的映射.

  • A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor needs.

但请注意:这些 ToUnicode 映射可能不完整,有时甚至包含故意不正确的映射!

Beware, though: these ToUnicode maps can be incomplete and sometimes even contain intentionally incorrect mappings!

PDF 字体编码定义可能包含预定义标准编码的名称(例如 WinAnsiEncodingGBpc-EUC-H)或标准化字符给定代码的名称(例如 spacesevenntilde).文本提取器只需要知道该编码名称所代表的编码或该字符名称所代表的代码.

The PDF font encoding definition may contain the name of a pre-defined standard encoding (e.g. WinAnsiEncoding or GBpc-EUC-H) or a standardized character name (e.g. space, seven, or ntilde) for a given code. A text extractor merely needs to know the encoding represented by that encoding name or the code represented by that character name.

编码也可能是一个身份(身份–H身份–V字符代码=字形代码) 不会泄露任何内容,并且字符名称也可能是非标准化的(例如 g17).

But the Encoding may also be an identity (Identity–H and Identity–V with character code = glyph code) which doesn't give away anything, and a character name may also be non-standardized (e.g. g17).

PDF 规范说:如果这些方法无法生成 Unicode 值,则无法确定字符代码代表什么,在这种情况下,符合要求的读者可以选择他们选择的字符代码.

如果是您的文本提取输出,我猜 PDF 字体有一个不完整的 ToUnicode 映射.

In case of your text extraction output I would guess the PDF font has an incomplete ToUnicode map.

实际上还有更多位置可以查找其他信息,例如字体程序可能包含自己的字形到 Unicode 的映射,但这些附加信息也是可选的.

Actually there are some more locations to look for additional information, e.g. the font program might include an own mapping of its glyphs to Unicode, but those additional information also are optional.

...及其非法的原因.

... and reasons for it's illegality.

在上述所有选项的情况下,我没有看到任何合理的字体许可被违反,特别是因为大多数这些选项甚至没有查看字体程序(例如 *.ttf)本身,只是查看了包装它的 PDF 元数据.

In case of all the above options I don't see any sensible font license being violated, in particular as most of those options didn't even look into the font program (e.g. the *.ttf) itself, merely into the PDF metadata wrapping it.

另一方面,如果例如您有想法为那些缺少这种映射的字体构建 ToUnicode 映射PDF 突然会使用字体程序绘制原始文档以外的其他内容,这可能被视为许可证未涵盖的用途.

On the other hand, if e.g. you had the idea to construct ToUnicode maps for those fonts missing such a map by drawing each glyph of the font onto a bitmap, nicely separated from anything else, and applying OCR to it, you as the recipient of the PDF suddenly would use the font program to draw something else than the original document, and this might be considered usage not covered by the license.

这篇关于如何处理 PDFMiner 提取的文本中的 CID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆