由于ToUnicode映射,PDF文本提取返回错误的字符 [英] PDF text extraction returns wrong characters due to ToUnicode map

查看:75
本文介绍了由于ToUnicode映射,PDF文本提取返回错误的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PDFMiner从外语PDF文件中提取文本,但被ToUnicode语句挫败了.即使在普通的PDF查看器中,该文件的行为也很奇怪.

I am trying to extract text from a foreign language PDF file using PDFMiner, but am being foiled by a ToUnicode statement. The file behaves strangely even under normal PDF viewers.

例如,以下是文件中某些文本的屏幕截图:

For example, here is a screenshot from some text in the file:

但是,如果我选择并复制文本,则它看起来像这样:

But if I select and copy the text, it looks like this:

िनरकर

您会看到几个字符已更改,尤其是倒数第二个字符.

You can see several characters have changed, in particular the second-to-last character.

毫不奇怪,PDFMiner提取了不正确的文本.但是每个PDF查看器都设法正确显示这些数据.我怀疑问题是ToUnicode映射或带有连字符的问题.所需字母应为0x915、0x94D,0x937的序列. PDFMiner仅报告0x915,它描述了另一个字符.

Not surprisingly, PDFMiner extracts the incorrect text. But every PDF viewer manages to display these data correctly. I suspect the issue is either the ToUnicode map, or something with conjoined characters. The desired letter should be a sequence of 0x915, 0x94D, 0x937. PDFMiner only reports 0x915, which describes a different character.

我需要做什么才能使PDFMiner正确提取文本,即图像中的文本,而不是复制粘贴的文本?

What do I need to do to get PDFMiner to extract text correctly, i.e. as in the image rather than the copy-pasted text?

这是一个链接至所讨论的PDF

推荐答案

简而言之:

您的PDF不包含不使用OCR即可正确提取文本所需的信息.

Your PDF does not contain the information required for correct text extraction without the use of OCR.

详细信息:

ToUnicode 映射和PDF中Mangal-Regular嵌入式子集的字体程序中的Unicode条目都声称这四个字形

Both the ToUnicode Map and the Unicode entries in the font program of the embedded subset of Mangal-Regular in your PDF claim that these four glyphs

全部代表相同的Unicode代码点0x915.

all represent the same Unicode code point, 0x915.

因此,任何不查看绘制的字形(即不尝试OCR)的文本提取程序都将为这些字形之一返回0x915.

Thus, any text extraction program which does not look at the drawn glyph (i.e. not attempt OCR) will return 0x915 for either one of those glyphs.

背景:

您似乎想知道为什么PDF查看器可以正确显示文本,而文本提取(复制和粘贴或PDFMiner)却不能正确提取.

You seem to wonder why the PDF viewers correctly display the text but text extraction (copy&paste or PDFMiner) does not correctly extract.

原因是PDF格式不包含这样的文本.它包含指向嵌入式字体程序中字形绘图指令的指针(直接指针或通过映射). 使用这些指针可以按预期绘制PDF.

The reason is that PDF as a format does not contain the text as such. It contains pointers (direct ones or via mappings) to glyph drawing instructions in embedded font programs. Using these pointers the PDF is drawn as you expect.

此外,它还可以包含其他信息,例如将字形指针映射到Unicode代码点.文本提取程序会使用这些额外的信息. 对于您的PDF,这些映射不正确,因此提取的文本不正确.

Furthermore it can contain extra information mapping such glyph pointers to Unicode code points. Such extra information is used by text extracting programs. In case of your PDF these mappings are incorrect and, therefore, extracted text is incorrect.

这篇关于由于ToUnicode映射,PDF文本提取返回错误的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆