解析PDF文件 [英] Parsing PDF files

查看:142
本文介绍了解析PDF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现很难解析以非英语语言创建的pdf文件.我使用了pdfbox和itext,但在其中找不到任何有助于解析此文件的内容.这是我正在谈论的pdf文件: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf pdf文件说它是使用LaTeX和Tikkana字体创建的.我的机器上安装了Tikkana字体,但这没有帮助.请帮助我.

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.

谢谢,K

推荐答案

当您说解析PDF文件"时,我首先想到的是该PDF并未在各种PDF查看器中打开.库,因此以某种方式损坏.

When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.

但事实并非如此.可以在Acrobat Reader X中很好地打开它.然后我看到页面上的文本.

But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.

当我从首页复制/粘贴该文本时,我得到:

And when I copy/paste that text from the first page, I get:

Ûûp{¨¶ðQ{p {¨| = {pÛû{¨>üb¶úN} l {¨d{p {¨>>Ûpû¶bp{¨} | =/} pT¶=} Nm {Z {Úpd{m}a¾Ú} mp {Ú¶¨>ztNð{øÔ_c} m {ТÁ} = N {Nzt¶ztbm}¥Ázv¬b¢Á ÁÛûÁøÛûzÏrze¨= ztTzv}lÛzt{¨déc} p {Ðu{¨½ÐuÛ½{=ÛÁ{=ÁÁÁbÛûßb} q {d {p}

Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á

来自读者.

此PDF中的许多文本都是使用各种"Type 3"字体编写的.这些字体声称使用带有差异"数组的"WinAnsiEncoding"(也称为代码页1252).此差异数组是错误的:

Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:

47/BB 61/BP/BQ 81/C6 ...

47 /BB 61 /BP /BQ 81 /C6...

第一个数字是要替换的代码点,第二个数字是要替换该代码点原始值的字符的名称.

The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.

没有诸如BB,BP,BQ,C9等字符名称.因此,当您复制粘贴该文本时,就会得到上面的垃圾.

There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.

很抱歉,但是从此类PDF中提取文本的唯一可靠方法是OCR(光学字符识别).

I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).

嗯...远射的想法:

如果您可以找到用于生成此PDF的特定字体的特定版本,则您也许能够确定以这种方式转换为Type 3字体的已知字符的实际流内容.

If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.

一旦有了这些已知流,就可以将它们与PDF中的流进行比较,并使用它们来构建自己的翻译表.

Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.

您可以修复现有的PDF(通过更改编码字典中的名称和Type 3 charproc条目),以使这些文本提取器能够正常工作,或者只是从流中获取字节并自己翻译

You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.

工作流程如下:

  1. 对于格式中使用的字体中的每个字符:
  1. For each character in a font used in the form:
  1. 使用相同的LaTeK/GhostScript版本将其自身呈现为PDF.
  2. 打开PDF并找到该特定已知字符的CharProc.
  3. 将该流与用于构建该流的已知字符一起存储.

  • 对于要解释的PDF中的每个文本字节.

  • For each text byte in the PDF to be interpreted.

    1. 根据现有的编码数组获取给定字节的字形名称
    2. 获取该字形名称的"char proc"流,并将其与您已知的char proc进行比较.

  • 注意:可以通过一些缓存将其重写为更有效的方法,但是(我希望)可以理解这个想法.

    NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).

    所有这些都需要对PDF及其所涉及的解析方法有相当深入的了解.但这也许行得通.可能不太好...

    All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

    这篇关于解析PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆