如何使用java读取pdf中的控制字符 [英] How To read control characters in a pdf using java

查看:154
本文介绍了如何使用java读取pdf中的控制字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDFBox来阅读PDF文件。但有些字符打印效果不佳,打印效果与控制字符类似。有人帮助从控制字符中读取值。我附上了图片请看看那张图片
示例PDF:



屏幕截图:




示例代码

  class PDFManager {

私有PDFParser解析器;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc;
私人COSDocument cosDoc;

private String Text;
private String filePath;
private文件;

public PDFManager(){

}

public String ToText()抛出IOException {
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new FileInputStream(file));

parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(3);
pdfStripper.setEndPage(4);
Text = pdfStripper.getText(pdDoc);

返回文字;
}

public void setFilePath(String filePath){
this.filePath = filePath;
}
}


解决方案

为什么你得到正确的泰米尔语字母和不正确的控制序列,是相应的字体




  • 没有 ToUnicode 地图和

  • 使用非标准名称对某些字形进行编码条目。



在这种情况下,PDFBox无法在没有帮助的情况下正确提取关联字符。



要帮助PDFBox,您必须检查是否所有文件(或者至少在足够大的子集中感兴趣)对于每个非标准名称,绘制的字形是相同的。如果是这种情况,您可以告诉PDFBox将这些名称中的每一个的映射添加到分别绘制到其已知字形映射库的字母的Unicode值。



更详细:



问题



我将通过示例说明此问题。



在OP提到的第3页上,第一个文本是使用相当于这些的说明绘制的:

  / R9 8.04 Tf 
0.999418 0 0 1 519.6 791.721 Tm
[< 01> 6.75242< 0C> -0.371893< 0D> 4.89295< 3> 3.77727< 14> -6.13989< 35> -4.51376< 02> -5.00233< 0F> 187.988] TJ

(我只是将字符串的表示形式更改为十六进制,因为单个代码大多数都在控制字符范围内,因此,这里不能正确显示。)



字体此页面的 R9 没有 ToUnicode 地图。也没有任何 ActualText 条目。因此,PDFBox只能使用字体的编码条目:

 << 
/ BaseEncoding / WinAnsiEncoding
/差异[1
/ u0BC6 / u0B9A / g125 / u0BC8 / u0BA9 / g121 / u0B9F / u0BAE
/ u0BB1 / g123 / space / u0BA4 / u0BBE / g148 / u0BBF / u0B8E
/ g122 / u0BAA / u0BAF / g129 / g130 / g178 / g127 / u0B92
/ g162 / g116 / u0B95 / u0BC0 / g158 / u0BA8 / u0BB2 /冒号
/ u0B85 / g117 / g173 / g132 / u0BB3 / g182 / g142 / one
/ period / g175 / u0BB5 / u0BB0 / g126 / u0B86 / u0BC7 / g186
/ g156 / g131 / g143 / two / g118 / g133 / g190 /连字符
/零/ 5 / g171 / g120 / g146 / g169 / g152 / parenleft
/ 7 / parenright / three / g180 / u0BA3 / eight / g136 / u0BB4
/ u0B9C / four / six / g124 / nine / g135 / slash / g172
/逗号/ u0B87 / numbersign / g128 / g147 / g160 / u0B9E / u0B89
/ u0BB7 / g119 / g157 / g167 / g191 / g188 / g170 / g145
/ g181 / u0BB8 / u0B90 / uni25CC / u0BCD / u0BB9 / u0BC1 / u0B88
/ g163 / u0BD7 / g184 / u0B8F / g174 / g153 / g138 / g185
/ g134 / g149 / g176]
/类型/编码
>>

如您所见,它首先声称基本编码 WinAnsiEncoding ,可以忽略因为或多或少在字体使用的代码范围内的所有映射然后在差异数组中被替换。



差异你可以找到的数组




  • 许多标准名称,如逗号两个;

  • 许多使用 uXXXX 方案表示unicode代码点的名称,例如 u0BC6 u0B9A ;

  • 使用 uniXXXX 方案表示unicode代码点的一个名称: uni25CC ;和

  • 使用 gXXX 计划的许多完全非标准名称,如 g121 g176



PDFBox支持标准名称(显然),另外还使用了unicode代码点命名变体(经常找到并且其名称)解释很直接)。



它不支持开箱即用的其他名称。



因此,为第一个文本绘制指令提取的文本是:

 < 01> -  / u0BC6  -  0BC6  - ெ
< 0C> - / u0BA4 - 0BA4 - த
< 0D> - / u0BBE - 0BBE - ா
< 07> - / u0B9F - 0B9F - ட
< 14> - / g129 ?? 0014-<设备控制四>
< 35> - / g118 ?? 0035 - 5
< 02> - / u0B9A - 0B9A - ச
< 0F> - / u0BBF - 0BBF - ி

产生第一行提取文本:





顺便说一句,这对应于实际PDF中的这一部分:





允许正确提取的可能方法



PDFBox提供了允许您向其已知名称的地图添加名称的机制。因此,如果 gXXX 名称经常代表文档中相应的字符,则可以调整PDFBox文本提取以满足您的要求。



稳定的PDFBox版本1.8.X使用与2.0.0版本候选版本不同的机制。因此:



对于PDFBox 1.8.X ,您必须创建一个字形列表文本文件。对于每个字形,它包含一行,带有2个以分号分隔的字段,字形名称和Unicode标量值,例如

  A; 0041 
AE; 00C6

然后定义系统属性 glyphlist_ext 指向该列表,例如开始你的程序时

  java -Dglyphlist_ext = / path / to / my / extra / glyphs ... 

对于PDFBox 2.0.0 此机制已被替换并多次移动,我不知道这是当前的。



在处理



<这些几乎是我上面引用的说明。但是,正如您所看到的,遗憾的是,字符串在此处无法检查,因为它们的内容主要位于Unicode控制字符范围内。因此,我保存了内容流(右键单击,上下文菜单)并检查了这些指令的十六进制视图





使用这些信息我创建了上面的说明报价。



字体(参见ISO 32000-1第9.6节) )在那些说明中选择的是 R9 (参见ISO 32000-1第9.3.1节),所以我继续查看字体资源(参见ISO 32000-1第7.8.3节)第3页上的名称,首先未成功搜索 ToUnicode 条目(参见ISO 32000-1第9.10.3节),然后成功找到编码(参见ISO 32000- 1节9.6.6):



< img src =https://i.stack.imgur.com/99Mbf.pngalt =RUPS第3页字体R9编码>



这是我复制和漂亮的你有点得到上面的编码报价。



从这些信息中我手动创建了带有字形id的表(来自显示指令块中操作的文本),相应的name(来自编码差异),假设的unicode代码点(从 uXXXX 名称派生, gXXX 再次命名字形id)和字符(来自其中一个)互联网上的许多Unicode表网站。)



为了从实际的PDF页面中找到相应的部分,我采用了 Tm的最后两个参数文本矩阵设置操作(参见ISO 32000-1第9.4.2节)考虑了聚合变换矩阵的变化(参见ISO 32000-1第8.4节)。这些是由显示指令的以下文本绘制的文本基线起点的坐标。


I'm using PDFBox to read PDF files. But some characters are not printing well and printing like control characters. Some one help to read the values from the control characters. I've attached the image Kindly have a look at that image Sample PDF:

Screenshot:

Sample Code

class PDFManager {

   private PDFParser parser;
   private PDFTextStripper pdfStripper;
   private PDDocument pdDoc ;
   private COSDocument cosDoc ;

   private String Text ;
   private String filePath;
   private File file;

   public PDFManager() {

   }

   public String ToText() throws IOException {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;
       file = new File(filePath);
       parser = new PDFParser(new FileInputStream(file));

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper(); 
       pdDoc = new PDDocument(cosDoc);  
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(3);
       pdfStripper.setEndPage(4); 
       Text = pdfStripper.getText(pdDoc);

       return Text;
   }

   public void setFilePath(String filePath) {
       this.filePath = filePath;
   }
}

解决方案

The reason why you get both correct Tamil letters and incorrect control sequences, is that the respective fonts

  • don't have a ToUnicode map and
  • have an Encoding entry using non-standard names for some glyphs.

In such a situation PDFBox cannot properly extract associated characters without help.

To help PDFBox you have to check whether in all the documents (or at least in a sufficiently big subset to be of interest) for each non-standard name the drawn glyphs are identical. If this is the case, you can tell PDFBox to add mappings from each of these names to the Unicode value of the respectively drawn letter to its reservoir of known glyph mappings.

In more detail:

The issue

I'll illustrate the issue here with an example.

On the page 3 mentioned by the OP the first text is drawn using instructions equivalent to these:

/R9 8.04 Tf
0.999418 0 0 1 519.6 791.721 Tm
[<01>6.75242<0C>-0.371893<0D>4.89295<07>3.77727<14>-6.13989<35>-4.51376<02>-5.00233<0F>187.988]TJ 

(I merely changed the representation of the strings to hexadecimal as the individual codes mostly are in the control character range and, therefore, would not display properly here.)

The font R9 of this page does not have a ToUnicode map. Neither are there any ActualText entries. Thus, PDFBox can merely use the Encoding entry of the font:

<<
  /BaseEncoding/WinAnsiEncoding
  /Differences[1
    /u0BC6/u0B9A/g125/u0BC8/u0BA9/g121/u0B9F/u0BAE
    /u0BB1/g123/space/u0BA4/u0BBE/g148/u0BBF/u0B8E
    /g122/u0BAA/u0BAF/g129/g130/g178/g127/u0B92
    /g162/g116/u0B95/u0BC0/g158/u0BA8/u0BB2/colon
    /u0B85/g117/g173/g132/u0BB3/g182/g142/one
    /period/g175/u0BB5/u0BB0/g126/u0B86/u0BC7/g186
    /g156/g131/g143/two/g118/g133/g190/hyphen
    /zero/five/g171/g120/g146/g169/g152/parenleft
    /seven/parenright/three/g180/u0BA3/eight/g136/u0BB4
    /u0B9C/four/six/g124/nine/g135/slash/g172
    /comma/u0B87/numbersign/g128/g147/g160/u0B9E/u0B89
    /u0BB7/g119/g157/g167/g191/g188/g170/g145
    /g181/u0BB8/u0B90/uni25CC/u0BCD/u0BB9/u0BC1/u0B88
    /g163/u0BD7/g184/u0B8F/g174/g153/g138/g185
    /g134/g149/g176]
  /Type/Encoding
>> 

As you see it first claims a base encoding WinAnsiEncoding which can be ignored because more or less all the mappings in the range of the codes the font uses are then replaced in the Differences array.

In the Differences array you can find

  • a number of standard names like comma and two;
  • many names representing unicode code points using the uXXXX scheme, like u0BC6 and u0B9A;
  • one name representing a unicode code point using the uniXXXX scheme: uni25CC; and
  • many completely non-standard names using a gXXX scheme like g121 and g176.

PDFBox supports standard names (obviously) and additionally both used unicode code point naming variants (which are often found and whose interpretation is very straight forward).

It does not support other names out of the box.

Thus, the text extracted for that first text drawing instruction is:

<01> - /u0BC6 - 0BC6 - ெ
<0C> - /u0BA4 - 0BA4 - த
<0D> - /u0BBE - 0BBE - ா
<07> - /u0B9F - 0B9F - ட
<14> - /g129 ?? 0014 - <DEVICE CONTROL FOUR>
<35> - /g118 ?? 0035 - 5
<02> - /u0B9A - 0B9A - ச
<0F> - /u0BBF - 0BBF - ி

resulting in your first line of extracted text:

By the way, this corresponds to this section from the actual PDF:

A possible way to allow proper extraction

PDFBox provides mechanisms allowing you to add names to its map of known names. If those gXXX names regularly represent the same respective character in your documents, therefore, you can tweak PDFBox text extraction to meet your requirements.

The stable PDFBox version 1.8.X use a different mechanism than the 2.0.0 release candidates. Thus:

For PDFBox 1.8.X you have to create a glyph list text file. For each glyph it contains one linewith 2 semicolon-delimited fields, the glyph name and the Unicode scalar value, e.g.

A;0041
AE;00C6

Then you define a system property glyphlist_ext pointing towards that list, e.g. when starting your program

java -Dglyphlist_ext=/path/to/my/extra/glyphs ...

For PDFBox 2.0.0 this mechanism has been replaced and moved multiple times, I have no idea which is the current one.

While working on PDFBOX-2379, an exception has been introduced to be thrown if the above mentioned system property is found:

throw new UnsupportedOperationException("glyphlist_ext is no longer supported, "
    + "use GlyphList.DEFAULT.addGlyphs(Properties) instead");

Unfortunately GlyphList does not have that method addGlyphs anymore.

While working on PDFBOX-2380 it has been removed and replaced:

I've replaced the static DEFAULT glyph list with a getAdobeGlyphList() method, as some PDFBox font internals require this to be the AGL and not some other additional glyph list. The loading and use of the additional glyphlist is application specific and so has been moved to PDFStreamEngine, where the getGlyphList() method can be overridden to pass custom glyph lists to fonts.

Unfortunately, PDFStreamEngine does not have that getGlyphList method anymore.

And I'm not currently in a mood to continue hunting high and low to find that feature again. Arg.

The making-of

In a comment the OP asked how I retrieved the information above from the PDF file in question.

First of all I used a PDF internals browsing application, e.g. iText RUPS or PDFBox PDFDebugger, to inspect the PDF and the PDF specification ISO 32000-1 to understand what I'm inspecting.

The OP in particular pointed to page 3 of his document, so I looked for the first text showing operators (cf. ISO 32000-1 section 9.4.3) in the content stream (cf. ISO 32000-1 section 7.8.2) of that page (cf. ISO 32000-1 section 7.7.3.3):

These almost are the instructions I quoted above. As you see, though, the character strings unfortunately cannot be inspected here because their contents are mostly in the Unicode control character range. Thus, I saved the contents stream (right-click, context menu) and inspected a hex view of those instructions

Using these information I created the instruction quote above.

The font (cf. ISO 32000-1 section 9.6) selected in those instructions is R9 (cf. ISO 32000-1 section 9.3.1), so I continued by looking at the font resource (cf. ISO 32000-1 section 7.8.3) with that name on page 3, first unsuccessfully searching a ToUnicode entry (cf. ISO 32000-1 section 9.10.3), then succeeding in finding an Encoding (cf. ISO 32000-1 section 9.6.6):

This I copied and prettied-up somewhat to get the encoding quote above.

From these information I manually created the table with the glyph ids (from the text showing operation in the instruction block), the corresponding name (from the encoding differences), the assumed unicode code point (derived from uXXXX names, for gXXX names the glyph id again), and the character (from one of the many Unicode table sites on the internet).

To find the corresponding section from the actual PDF page I took the final two arguments of the Tm text matrix setting operation (cf. ISO 32000-1 section 9.4.2) taking the aggregated transformation matrix changes (cf. ISO 32000-1 section 8.4) into account. These are the coordinates of the start of the base line of the text drawn by the following text showing instruction.

这篇关于如何使用java读取pdf中的控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆