Ruby 从 PDF 中提取阿拉伯文本 [英] Ruby extract arabic text from PDF

查看:27
本文介绍了Ruby 从 PDF 中提取阿拉伯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常使用此代码从 PDF 中提取文本:

I usually use this code to extract text from PDFs:

require 'rubygems'
require 'pdf/reader'

filename = File.expand_path(File.dirname(__FILE__)) + "/myfile.pdf"

PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
  end
end

这次我想解析一个阿拉伯语 PDF,但是,使用这段代码,我得到了一堆奇怪的字符.例如:±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L

This time I want to parse an Arabic PDF, but, using this code, I get a bunch of weird characters. For example: ±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L

我已经读过 coding: utf-8 对阿拉伯语很好,那么,有什么解决方案吗?

I have already read that coding: utf-8 is fine for Arabic, so, is there any solution?

推荐答案

此 PDF 中的文本未正确编码:屏幕上显示的内容与其代表的字符代码之间的关系未存储在此 PDF 中.这就是您获得随机"文本的原因.

The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.

另外值得注意的是:文本以正确的顺序出现,但这是因为字体字符被绘制镜像并且文本本身也被镜像绘制:

Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:

-- 使用 Quark XPress 正确排版阿拉伯语的典型 hack-ish 解决方法(曾经有一个启用"此功能的 XTension (sp.?)).

-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).

似乎这种错误的编码实际上是在字体内部定义的(字体使用内置编码",根据 Acrobat Pro 的库存"功能),您可能能够在您的字符之间找到一个翻译表正在阅读以及它们实际上应该是什么.请注意,这些表格可能因本文档中的每种字体而异,因此您必须检查每个文本字符串使用的字体.

As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.

我做了一些进一步的调查,他们同意您和 Acrobat Pro 的调查结果.您的示例文本如下所示:

I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:

/F1 1 Tf        % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ

通常,PDF 中的字体条目包含一个转换"为实际字符代码的表格.对于这种字体(以及所有其他字体)也是如此:

Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):

<<
  /Type     /Font
  /Subtype  /Type1
  /BaseFont     /HGKECF+PHBagdad
  /Encoding     66 0 R
  /ToUnicode    69 0 R
>>

(仅列出相关条目)./Encoding 条目指向一个简单的索引数组 > 字符代码列表,而 /ToUnicode 指向一个更正式的表,它基本上包含相同的内容.两个列表都翻译成相同的文本.

(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.

如上图所示,字体包含阿拉伯字形(镜像),但链接到这些字形的代码对于阿拉伯语来说并不正确.这就像旧的符号"字体黑客:键入a"以获得 alpha,键入b"获取 beta,键入g"获取伽玛:屏幕上的文本 appears 为ɑβɣ"但实际上它说的是abg".

As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".

另见 Adob​​e 论坛主题:阿拉伯语 - ToUnicode 映射不正确?

See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?

引用:

从操作系统(MacOS 或 Windows)的角度来看,Arabic XT 字体不是阿拉伯字体.他们使用 Mac Roman 编码;阿拉伯字形被放置在罗马字形的位置.

Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.

我试图为您的字体找到更正"编码,​​但到目前为止还没有成功.如果我能找到一个翻译表,那么应该可以将现有的 /ToUnicode 表与更正的表进行交换,并且在提取时您会得到正确的文本.(尽管在您选择的编程语言中提取后使用同一个表来更改文本字符串可能更简单.)

I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

这篇关于Ruby 从 PDF 中提取阿拉伯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆