pdfBox 返回错误的编码字符 [英] pdfBox Return Bad Encoding Charachter

查看:130
本文介绍了pdfBox 返回错误的编码字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 pdf http://www.persianacademy.ir/UserFiles/File/fe1394.pdf我想从中提取单词(包含波斯语单词.).我使用 PDFBox 库来获取单词.这是我的代码:

i have a pdf http://www.persianacademy.ir/UserFiles/File/fe1394.pdfthat i want to extract words from it(contain persian words.).i use PDFBox library to get words.here is my code:

package ir.blog.stack;

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFManager {

    public static void main(String[] args) {
        PDFManager pdfManager = new PDFManager();
        pdfManager.setFilePath("/home/saeed/Documents/words.pdf");
        try {
            System.out.println(pdfManager.ToText());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private PDFParser parser;
    private PDFTextStripper pdfStripper;
    private PDDocument pdDoc ;
    private COSDocument cosDoc ;

    private String Text ;
    private String filePath;
    private File file;

    public PDFManager() {

    }
    public String ToText() throws IOException
    {
        this.pdfStripper = null;
        this.pdDoc = null;
        this.cosDoc = null;

        file = new File(filePath);
        parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdDoc.getNumberOfPages();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(5);

        // reading text from page 1 to 10
        // if you want to get text from full pdf file use this code
        // pdfStripper.setEndPage(pdDoc.getNumberOfPages());
        Text = pdfStripper.getText(pdDoc);
        return Text;
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

}

这是输出的一部分:

° ǽA ° SwA ²j±ÇÇM/SwA ²joÇ Ak¼ÇQ ³Ç«AjA p°oÇ«A ³ÇM BÇU éÇ
BÇM ¤ Ø°A ·ª¦ °j ³ An <»wB®{Sv½p> ° <»wB®z¯BMp> ,<³¯BhQBa> ,<³¯BiRnB\U>
»¯BwC³ÇM ©½o¼¢Moǯnj kǯA²k{ ³TiBw <»wB®{> BM ¨°j ·ª¦ °j ° <³¯Bi> ·ª¦
k{BÇM ³TÇ{Aj j±]° o¯ ßB
UA ¬C nj ³ ºA²kîB RBª¦ ½A ߺÀ«A ³ ©¼MB½»«nj
/jnAk¯
° ²k{tBLTA »¼® Øßi pA j±i »Moî Øßi ° ²k{ ³To£ »Moî Øßi pA B« Øßi

我应该做额外的动作来获得正确的词吗?

shall i do extra actions to get right words?

推荐答案

有问题的 PDF 根本不包含文本提取所需的信息.您将不得不尝试使用 OCR.

The PDF in question simply does not contain the information required for text extraction. You will have to try with OCR.

为了成功从 PDF 中提取文本,PDF 必须包含一些关于每个使用的字形代表哪个 Unicode 字符的信息.

For text extraction from a PDF to succeed, the PDF must contain some information on which Unicode character is represented by each used glyph.

PDF 规范描述了以下文本提取过程:

The PDF specification describes the following text extraction process:

符合要求的读者可以使用这些方法,按照给定的优先级,将字符代码映射到 Unicode 值.带标签的 PDF 文档尤其应提供以下方法中的至少一种(参见 14.8.2.4.2,带标签的 PDF 中的 Unicode 映射"):

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • 如果字体字典包含 ToUnicode CMap(参见 9.10.3,ToUnicode CMap"),使用该 CMap 将字符代码转换为 Unicode.

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

如果字体是使用预定义编码 MacRomanEncodingMacExpertEncodingWinAnsiEncoding 之一的简单字体,或者有一个编码,其 Differences 数组只包含取自 Adob​​e 标准拉丁字符集的字符名称和符号字体中的命名字符集(参见附件 D):

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a) 根据表 D.1 和字体的 Differences 数组将字符代码映射到字符名称.

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b) 在 Adob​​e Glyph List(见参考书目)中查找字符名称以获得相应的 Unicode 值.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

如果字体是使用表 118 中列出的预定义 CMap 之一的复合字体(Identity–H 和 Identity–V 除外)或其后代 CIDFont 使用 Adob​​e-GB1、Adobe-CNS1、Adobe-Japan1,或 Adob​​e-Korea1 字符集:

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a) 根据字体的 CMap 将字符代码映射到字符标识符 (CID).

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b) 从其 CIDSystemInfo 字典中获取字体的 CMap(例如 Adob​​e 和 Japan1)使用的字符集合的注册表和排序.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c) 以 registry–ordering–UCS2 格式(例如 Adob​​e–Japan1–UCS2)连接在步骤 (b) 中获得的注册表和排序,构造第二个 CMap 名称.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d) 使用步骤 (c) 中构造的名称获取 CMap(可从 ASN 网站获得;请参阅参考书目).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e) 根据步骤 (d) 中获得的 CMap 映射步骤 (a) 中获得的 CID,产生一个 Unicode 值.

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

如果这些方法无法生成 Unicode 值,则无法确定字符代码代表什么,在这种情况下,符合要求的读者可以选择他们选择的字符代码.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

在示例 PDF 的情况下,有问题的字体

In case of the sample PDF, the fonts in question

  • 没有 ToUnicode 映射;
  • 是复合的;
  • 使用Identity-H作为编码
  • 具有 Adob​​e-Identity-0 的 CIDSystemInfo 值.
  • do not have ToUnicode maps;
  • are composite;
  • use Identity-H as Encoding;
  • have a CIDSystemInfo value of Adobe-Identity-0.

因此,上面引用的过程无法生成 Unicode 值.

Thus, the process quoted above fails to produce a Unicode value.

PDF 规范允许在结构元素字典或标记内容序列中使用 ActualText 条目来覆盖某些内容应表示的文本.

The PDF specification alternatively allows the use of ActualText entries in structure element dictionaries or marked-content sequences to override the text some content shall represent.

对于示例 PDF,不使用 ActualText 条目.

In case of the sample PDF, no ActualText entries are used.

人们可以比 PDF 规范描述的更深入,特别是可以深入到嵌入式字体程序中,以找到有关某些字体字形代表的 Unicode 字符的字体特定信息.

One can look deeper than the PDF specification describes, in particular one can dive into the embedded font programs to find font specific information on the Unicode characters some font glyph represents.

在示例 PDF 的情况下,嵌入的字体程序

In case of the sample PDF, the embedded font programs

  • 不包含字形的 Unicode 值;
  • 使用不提供信息的字形名称,例如glyph89".

因此,对于示例 PDF,您很可能不得不求助于 OCR.

Thus, in case of the sample PDF, you most likely will have to resort to OCR.

这篇关于pdfBox 返回错误的编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆