文本提取为空,未知文本具有使用PDFBox,iText的type3字体(难题!) [英] Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!)

查看:87
本文介绍了文本提取为空,未知文本具有使用PDFBox,iText的type3字体(难题!)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有阿拉伯语的PDF文件,当我使用PDFBox提取文本时,文本字体为Type3,有些字符为空,字体等于null?我想知道这是什么问题?

I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem?

代码:

  protected void processTextPosition(TextPosition text) {
    String character=text.getCharacter(); // is empty
    String font=text.getFont().getBaseFont(); // equal null
}

用iText生成的流: (dJ v{dW cG )Tj

我说的是这些问号,为什么我得到这种格式的字符?

I speak about these question marks, why do I get the characters in this format?

这些问号在我的信息流中出现为SOH-STX-ETX-EOT,而不是一个字符。 PDF中的字符显示为'd''J'

These question marks appeared in my stream as "SOH-STX-ETX-EOT", not one character. The character inside PDF is shown as 'd' and 'J'!

推荐答案

Type 3字体是用户定义的字体。例如:用户可以定义字符P对应于以前称为王子的艺术家的符号( TAFKAP )这是一个字形,但不是来自任何已知字母的字母。

A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.

Type 3字体中的字形是一系列的线条和形状,并且iText或PDFBox等程序无法确定哪个字符的含义。你得到一个问号是很正常的。例如:您将使用哪个字符用于符号?

A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?

以下原因之一适用于包含Type 3字体的PDF:

One of the following reasons applies for a PDF that contains Type 3 fonts:


  1. 字体是用于引入任何字体不存在的符号。

  2. 该字体用于混淆PDF的内容,以便无法提取其内容。

  3. PDF不是以优雅的方式创建的。

如果Type 3字体用于普通字符,您需要使用OCR将内容转换为普通文本。

If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.

这篇关于文本提取为空,未知文本具有使用PDFBox,iText的type3字体(难题!)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆