使用PDFminer解析pdf(Devanagari脚本)会产生错误的输出 [英] Parsing a pdf(Devanagari script) using PDFminer gives incorrect output

查看:116
本文介绍了使用PDFminer解析pdf(Devanagari脚本)会产生错误的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析一个印地文(Devanagari脚本)包含印度选民名单的pdf文件.

I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script).

PDF可以正确显示所有文本,但是当我尝试使用PDFminer将此pdf转储为文本格式时,它会输出与原始pdf字符不同的字符

PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different from the original pdf characters

例如 显示/正确的单词是सामान्य

For example Displayed/Correct word is सामान्य

但是输出单词是सपमपनसपमप

现在我想知道为什么会这样以及如何正确解析这种类型的pdf文件

Now I want to know why this is happening and how do I correctly parse this type of pdf file

我还包括示例pdf文件-

I am also including the sample pdf file-

http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

推荐答案

此问题与此答案,并显示此处的示例文档确实也提醒了此处的文件.

This issue is very similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here.

就像在其他问题中的文档一样,此处文档中使用的Devanagari脚本字体的 ToUnicode 映射将多个完全不同的字形映射到相同的Unicode代码点.因此,基于此映射的文本提取必定会失败,并且大多数文本提取器都依赖于这些信息,尤其是在缺少像这样的字体 Encoding 条目的情况下.

Just like in the case of the document in that other question, the ToUnicode map of the Devanagari script font used in the document here maps multiple completely different glyphs to identical Unicode code points. Thus, text extraction based on this mapping is bound to fail, and most text extractors rely on these information, especially in the absence of an font Encoding entry like here.

某些文本提取器可以使用字形到嵌入式字体程序(如果存在)中包含的Unicode的映射.但是,在本文中使用的Devanagari脚本字体程序中检查此映射后,结果发现它通过U + f062(名为"uniF020"等)将大多数字形与U + f020关联.

Some text extractors can use the mapping of glyph to Unicode contained in the embedded font program (if present). But checking this mapping in the Devanagari script font program used in the document here, it turns out that it associates most glyphs with U+f020 through U+f062 named "uniF020" etc.

这些Unicode代码点位于 Unicode专用区域中,即它们没有标准化的含义,但是应用程序可以根据需要使用它们.

These Unicode code points are located in the Unicode Private Use Area, i.e. they do not have a standardized meaning but applications may use them as they like.

因此,使用字体程序中包含的Unicode映射的文本提取器也不会立即提供可理解的文本.

Thus, text extractors using the Unicode mapping contained in the font program wouldn't deliver immediately intelligible text either.

尽管有一个事实,但是它可以帮助您主要自动从该文档中提取文本:在多个页面上,同一梵文引用了Devanagari脚本字体的PDF对象,因此在所有页面上引用相同的PDF对象,相同的原始字符标识符或相同的字体程序(专用)使用的Unicode代码点均引用相同的视觉符号.对于您的文档,我只算出该字体的5个副本.

There is one fact, though, which can help you to mostly automatize text extraction from this document nonetheless: The same PDF object is referenced for the Devanagari script font on multiple pages, so on all pages referencing the same PDF object the same original character identifier or the same font program private use Unicode code point refer to the same visual symbol. In case of your document I counted only 5 copies of the font.

因此,如果找到一个文本提取程序,该文本提取程序返回字符标识符(忽略所有toUnicode映射)或返回字体程序中的专用区域Unicode代码点,则可以使用其输出,而只需根据几张地图.

Thus, if you find a text extractor which either returns the character identifier (ignoring all toUnicode maps) or returns the private use area Unicode code points from the font program, you can use its output and merely replace each entry according to a few maps.

我还没有用过这样的文本提取器,所以我在python上下文中一无所知.但是谁知道呢,也许可以告诉pdfminer或任何其他类似的软件包(通过某些选择)忽略具有误导性的 ToUnicode 映射,然后按上面概述的那样使用.

I had not yet have use for such a text extractor, so I don't know any in the python context. But who knows, probably pdfminer or any other similar package can be told (by some option) to ignore the misleading ToUnicode map and then be used as outlined above.

这篇关于使用PDFminer解析pdf(Devanagari脚本)会产生错误的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆