如何从PDF提取文本才能正常工作? [英] How to get text extraction from PDF to work?

查看:93
本文介绍了如何从PDF提取文本才能正常工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从罗马尼亚语的PDF中提取文本.使用pdfBox或Snowtide无法正确提取符号:ȚțȘșĂăÎîâ.

I need to extract the text from a PDFs in Romanian language. The symbols: ȚțȘșĂăÎîÂâ are not extracted correctly with pdfBox or Snowtide.

这是一个不起作用的示例文件: ftp://ftp.logos.md/Biblioteca/_Colectie_RO/2nefon.pdf

Here is a sample file that is not working: ftp://ftp.logos.md/Biblioteca/_Colectie_RO/2nefon.pdf

有什么建议吗?

推荐答案

恐怕OP所指的PDF(

I'm afraid the PDF the OP pointed at (2nefon.pdf) does not provide the information required for text extraction according to the spec.

尝试从Adobe Reader复制和粘贴会导致特殊字符被错误地导出,并且由于Adobe Reader包含相当好的文本提取功能,因此这已经不是一个好兆头.

Trying to copy&paste from Adobe Reader results in the special characters being incorrectly exported, and as Adobe Reader contains quite good text extraction capabilities, this already is a bad sign.

检查文件将显示问题.例如.让我们看一下标题

Inspecting the file shows the problems. E.g. let's look at the title

内容流的相应段是:

/F1 24 Tf
-148.44 -26.16 TD
(VIA}A  {I  ~NV|}|TURILE) Tj
296.88 0 TD
( ) Tj
-308.16 -29.28 TD
(SFANTULUI  IERARH  NIFON) Tj

让我们检查使用的字体 F1 :

Let's check the used font F1:

469 0 obj
<< 
/Type /Font 
/Subtype /TrueType 
/Name /F1 
/BaseFont /TimesR 
/FirstChar 32 
/LastChar 255 
/Widths [ 250 333 444 722 500 833 778 [...] 500 500 500 500 500 500 500 ] 
/Encoding /WinAnsiEncoding 
/FontDescriptor 468 0 R 
>> 
endobj 

因此,该字体声称使用了 WinAnsiEncoding 而没有更改(没有差异).

Thus, the font claims to use WinAnsiEncoding without changes (no Differences).

最后看一下字体描述符:

A last look at the font descriptor:

468 0 obj
<< 
/Type /FontDescriptor 
/FontName /TimesR 
/Flags 34 
/FontBBox [ -167 -307 1009 913 ] 
/StemV 90 
/ItalicAngle 0 
/CapHeight 913 
/Ascent 913 
/Descent -307 
/FontFile2 474 0 R 
>> 
endobj

此处没有暗示前面提到的 WinAnsiEncoding 可能不是全部.

No hint here that the afore mentioned WinAnsiEncoding might not be the whole truth.

根据PDF规范 ISO 32000 -1

合格的读者可以按照给定的优先级使用这些方法,将字符代码映射到Unicode值.尤其是带标签的PDF文档,应至少提供以下方法之一(请参见14.8.2.4.2,带标签的PDF中的Unicode映射"):

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • 如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding MacExpertEncoding WinAnsiEncoding ,或者具有其 Differences 数组的编码,该数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a)根据表D.1和字体的差异数组将字符代码映射为字符名称.

a)Map the character code to a character name according to Table D.1 and the font’s Differences array.

b)在Adobe字形列表(请参见参考书目)中查找字符名称,以获得相应的Unicode值.

b)Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

如果字体是复合字体[...因为字体 F1 不是复合字体而缩短了...]

If the font is a composite font [... cut short because the font F1 is no composite font ...]

如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择自己选择的字符代码.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(第9.10.2节,将字符代码映射到Unicode值)

因此,在报告文档声称这两行内容为:

So text extraction and copying&pasting are completely following the specification when reporting that the document claims those two lines say:

VIA}A {I ~NV|}|TURILE
SFANTULUI IERARH NIFON

不过,您可能要检查是否Ă(带有A brevis的大写字母A)始终以 | 的形式导出;这实际上并非不可能,在上个世纪的一段时间里,将特殊字符映射到符号的字符代码是很普遍的.如果确实如此,提取文字后进行全局搜索并替换,即可为您提供所需的文字.

You might want to check, though, whether e.g. Ă (capital A with brevis) is always exported as |; this actually is not unlikely, mapping special characters to character codes of symbols was quite common for a time in the last century. If that indeed is the case, a global search&replace after text extraction gives you the desired text.

这篇关于如何从PDF提取文本才能正常工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆