如何处理PDFMiner提取的文本中的CID? [英] What to do with CIDs in text extracted by PDFMiner?

查看:1222
本文介绍了如何处理PDFMiner提取的文本中的CID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些PDF格式的印地语,并且具有可提取的文本.我对python 3.6使用了 pdfminer.six 进行提取.输出如下:

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like:

正如人们所看到的,有许多字符被转换为(cid:number)"形式.

As one can see, there are a number of characters that are converted into the form "(cid :number)".

经过进一步分析,我发现PDF包含将字符代码映射到字形索引的CMAP.因此,CID是它映射到的字形在CMAP表中的字符标识.

On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.

但是这些字符代码与Unicode值有何关系?基本上,PDF查看器如何使用此映射显示字形?

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

此外,根据对类似问题的评论,该过程可能不合法.但是我不是想偷别人的字体.我想要文字.此过程如何变为非法?

Moreover, according to a comment to this similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?

由于存在很多类似的问题,因此我想澄清一下,我的目的不是解决"cid"问题.我想弄清问题的原因和不合法的原因.

Since there are many questions like this one, I want to clarify that I do not aim at solving the "cid" problem. I want to clarify the reasons for the problem and reasons for it's illegality.

编辑:问题页面 for pdfminer讨论了这个问题,作者清楚地指出,对于此问题似乎没有可靠的解决方法.是否存在一些普遍的基本限制(例如,无法访问字体)使该问题持续存在?

This issues page for pdfminer discusses this issue, where the author clearly says that there seems to be no reliable workaround for this issue. Is there some general, basic limitation (like, no access to font) that makes this issue continual?

推荐答案

但是这些字符代码与Unicode值有何关系?基本上,PDF查看器如何使用此映射显示字形?

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

在PDF内容流中找到的字符代码不需要以任何明显的方式与Unicode值相关.尤其是,PDF查看器根本不需要Unicode代码点来输入字符代码即可显示匹配的字形.

The character codes one finds in the PDF content streams do not need to be related to Unicode values in any obvious way. In particular, a PDF viewer does not at all need a Unicode code point for a character code to show the matching glyph.

在PDF中,字体在字体程序中具有从字符代码到字形ID的映射(或映射序列),并且这种映射可能是完全任意的.

In a PDF a font has a mapping (or a sequence of mappings) from character code to glyph ID in the font program, and this mapping may be completely arbitrary.

例如对于嵌入式字体子集,通常通过以下方式创建子集字体程序:为页面上使用的第一个字形指定起始字形id n ,然后为该页面上的第二个字形id n提供不同的字形+1 ,然后是下一个不同的字形id n + 2 等,然后字符代码通常与字形id相同,即上面的映射是身份映射.如果不再有其他信息,则文本提取器将没有机会正确执行其工作.

E.g. in case of embedded font subsets the subset font program often is created by giving the first glyph used on a page a starting glyph id n, then giving the second, different glyph on that page id n+1, then the next, different glyph id n+2 etc etc, and then the character codes often are identical to the glyph ids, i.e. the mapping above is the identity mapping. If there are no additional information anymore, a text extractor has no chance to properly do its job.

我想弄清楚问题的原因

I want to clarify the reasons for the problem

常规文本提取通常具有以下选项来查找字符代码的Unicode值:

Regular text extraction usually has the following options to find the Unicode value for a character code:

  • PDF字体可能包括一个 ToUnicode 映射(从字符代码映射到Unicode),以支持诸如搜索字符串或复制&粘贴到PDF查看器中.该映射立即提供了文本提取器所需的映射.

  • A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor needs.

请注意:这些 ToUnicode 映射可能不完整,有时甚至包含故意不正确的映射!

Beware, though: these ToUnicode maps can be incomplete and sometimes even contain intentionally incorrect mappings!

PDF字体编码定义可以包含预定义标准编码(例如 WinAnsiEncoding GBpc-EUC-H )的名称或标准字符给定代码的名称(例如空格七个 ntilde ).文本提取器仅需要知道该编码名称表示的编码或该字符名称表示的代码.

The PDF font encoding definition may contain the name of a pre-defined standard encoding (e.g. WinAnsiEncoding or GBpc-EUC-H) or a standardized character name (e.g. space, seven, or ntilde) for a given code. A text extractor merely needs to know the encoding represented by that encoding name or the code represented by that character name.

但是编码也可能是一个身份( Identity–H Identity–V ,其中字符代码=字形代码),它不会给出任何东西,而且字符名称也可能是非标准化的(例如 g17 ).

But the Encoding may also be an identity (Identity–H and Identity–V with character code = glyph code) which doesn't give away anything, and a character name may also be non-standardized (e.g. g17).

PDF规范说:如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择他们选择的字符代码.

如果您的文本提取输出,我想PDF字体的 ToUnicode 映射不完整.

In case of your text extraction output I would guess the PDF font has an incomplete ToUnicode map.

实际上,还有更多位置可以查找其他信息,例如字体程序可能包括自己的字形到Unicode的映射,但是这些附加信息也是可选的.

Actually there are some more locations to look for additional information, e.g. the font program might include an own mapping of its glyphs to Unicode, but those additional information also are optional.

...及其违法原因.

... and reasons for it's illegality.

在上述所有选项中,我看不到任何明智的字体许可被侵犯,特别是因为这些选项中的大多数甚至都没有进入字体程序(例如* .ttf),而是仅仅进入了字体程序. PDF元数据将其包装.

In case of all the above options I don't see any sensible font license being violated, in particular as most of those options didn't even look into the font program (e.g. the *.ttf) itself, merely into the PDF metadata wrapping it.

另一方面,例如您本来想通过将字体的每个字形绘制到位图上并与其他任何东西很好地分离,然后将其应用OCR来为那些缺少此类映射的字体构造 ToUnicode 映射,您就是接收者PDF突然会使用字体程序绘制原始文档以外的其他内容,这可能被视为许可证未涵盖的用法.

On the other hand, if e.g. you had the idea to construct ToUnicode maps for those fonts missing such a map by drawing each glyph of the font onto a bitmap, nicely separated from anything else, and applying OCR to it, you as the recipient of the PDF suddenly would use the font program to draw something else than the original document, and this might be considered usage not covered by the license.

这篇关于如何处理PDFMiner提取的文本中的CID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆