Ruby从PDF提取阿拉伯文本 [英] Ruby extract arabic text from PDF

查看:89
本文介绍了Ruby从PDF提取阿拉伯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常使用以下代码从PDF中提取文本:

I usually use this code to extract text from PDFs:

require 'rubygems'
require 'pdf/reader'

filename = File.expand_path(File.dirname(__FILE__)) + "/myfile.pdf"

PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
  end
end

这次,我想解析阿拉伯文PDF,但是使用此代码,我得到了一堆奇怪的字符.例如:±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L

This time I want to parse an Arabic PDF, but, using this code, I get a bunch of weird characters. For example: ±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L

我已经读到coding: utf-8对于阿拉伯语来说是很好的,那么,有什么解决办法吗?

I have already read that coding: utf-8 is fine for Arabic, so, is there any solution?

推荐答案

此PDF中的文本未正确编码:该屏幕上显示的内容与它所代表的字符代码之间的关系未存储在此PDF中.这就是为什么您会收到随机"文本的原因.

The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.

也值得注意的是:文本以正确的顺序出现,但这是因为绘制了字体字符并且文本本身也被绘制了镜像:

Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:

-一种典型的破解方法,可以使用Quark XPress(以前是启用"此功能的XTension(sp.?))正确排版阿拉伯语.

-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).

似乎这种错误的编码实际上是在字体内部定义的(根据Acrobat Pro的清单"功能,字体使用内置编码"),您也许可以找到字符之间的翻译表正在阅读以及它们实际应该是什么.请注意,这些表对于本文档中的每种字体可能会有很大的不同,因此您必须检查每种文本字符串使用的字体.

As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.

我做了进一步的调查,他们同意您自己以及Acrobat Pro的调查结果.您的示例文本如下所示:

I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:

/F1 1 Tf        % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ

通常,PDF中的字体条目包含一个将翻译"为实际字符代码的表.这种字体(和所有其他字体)也是如此:

Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):

<<
  /Type     /Font
  /Subtype  /Type1
  /BaseFont     /HGKECF+PHBagdad
  /Encoding     66 0 R
  /ToUnicode    69 0 R
>>

(仅列出相关条目). /Encoding条目指向一个简单的索引>字符代码列表数组,而/ToUnicode指向一个更正式的表,该表实际上包含相同的内容.这两个列表都翻译成相同的文本.

(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.

如您在顶部图像中所见, font 包含阿拉伯字形(已镜像),但是链接到这些字形的 code 不适用于阿拉伯语.就像旧的"Symbol"字体黑客一样:键入"a"获取字母,"b"获取beta,"g"获取伽玛:屏幕上的文本出现为ɑβɣ"但实际上它说的是"abg".

As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".

另请参阅以下Adobe论坛线程:阿拉伯语-ToUnicode映射不正确?

See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?

报价:

从操作系统的角度(MacOS或Windows)来看,阿拉伯XT字体不是阿拉伯字体.他们使用Mac Roman编码;阿拉伯字形代替罗马字形.

Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.

我试图为您的字体找到一种更正"编码,​​但到目前为止还没有成功.如果我可以找到翻译表,则应该可以将现有的/ToUnicode表与已更正的表交换,并且在提取时您将获得正确的文本. (尽管使用您选择的编程语言在提取后使用相同的表来更改文本字符串可能会更简单.)

I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

这篇关于Ruby从PDF提取阿拉伯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆