iText的PDF阅读为1秒向上的箭头ERROR [英] itext reading pdf 1s as up arrows ERROR

查看:258
本文介绍了iText的PDF阅读为1秒向上的箭头ERROR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于某种原因iTextSharp的是现在在读PDF包含数字,如4123为4 * 23,其中*实际上是一个朝上的箭头。不知道为什么会这样。 。请帮助

For some reason itextsharp is now reading pdf which contains numbers such as 4123 as 4*23 where the * is actually a an arrow pointing up. Not sure why this is happening. Please help.

感谢

示例文件位于此处的 https://dl.dropboxusercontent.com/u/116833/SAMPLE%20PDF.pdf

推荐答案

的原因箭头是的文件实际上是试图误导的文字提取它根据准则提取文本第9.10.2的映射字符代码到PDF规范的Unicode值的的 ISO 32000-1 而不是混淆那些喜欢 ActualText的标记的内容序列条目:前一种方法导致相信'3的是箭,而后者则告诉3的是三分球

The reason for the arrows is that the file actually tries to mislead text extractors which extract text according to the guidelines of Section 9.10.2 Mapping Character Codes to Unicode Values of the PDF specification ISO 32000-1 while not confusing those which prefer ActualText marked-content sequence entries: The former method is lead to believe the '3's are arrows while the latter is told the '3's are threes.

最有可能这样做是为了防止自动文本提取,同时允许手动复制和放大器;粘贴由于Adobe Reader完全喜欢的 ActualText的标记的内容序列条目(因此,人工提取工作的很好),而许多程序提取更喜欢前一种方法。

Most likely this is done to prevent automated text extraction while allowing manual copy&paste because Adobe Reader does prefer the ActualText marked-content sequence entries (thus, manual extraction works all right) while many programmatic extractors prefer the former method.

至于我读了规范的有关章节,它更没有办法比其他。

As far as I read the relevant sections of the specification, it prefers neither way over the other.

例如:看第一部分号:

E.g. look at the first part number:

BT
/T1_1 1 Tf
10 0 0 10 69.1456 750.2834 Tm
(1 )Tj
ET
EMC 
/Span <</MCID 14 >>BDC 
BT
/T1_1 1 Tf
10 0 0 10 89.5488 750.2834 Tm
(2)Tj
/Span<</ActualText<FEFF0033>>> BDC 
(3)Tj
EMC 
(412109 )Tj
ET
EMC 

正如你所看到的'3'标有 ActualText的条目表明它是一个三确实(< FEFF0033> 是很长的路要走,指示中的Unicode位三)。

As you see the '3' is marked with an ActualText entry indicating that it is a three indeed (<FEFF0033> is a long way to indicate the Unicode digit three).

字体 T1_1 ,在另一方面,优惠促销一个 ToUnicode 包含映射流

The font T1_1, on the other hand, offers a ToUnicode stream containing the mapping

...
<30> <0030>
<31> <0031>
<32> <0032>
<33> <0018>
<34> <0034>
<35> <0035>
...



如你所见,而其他数字(的0x30是'0',0X31是'1',...,0x39是'9')被相同映射的,3,即0x33,被映射到统一代码点0x0018处,以及

As you see while other digits (0x30 is '0', 0x31 is '1', ... , 0x39 is '9') are mapped identically, the '3', i.e. 0x33, is mapped to the Unicode code point 0x0018, and

U + 0018是字符的Unicode十六进制值<控制> ,它被归类为以Unicode 6.0控制字符 。字符表

U+0018 is the Unicode hex value of the character <control>, which is categorized as "control character" in the Unicode 6.0 character table.

<控制> 。以前命名为旧版本的Unicode的取消

"<control>" was previously named "CANCEL" in older versions of Unicode.

(比照 HTTP ://www.marathon-studios.com/unicode/U0018/Control

在一些这方面控制字符显示为一个向上的箭头。

In some context this control character is displayed as an upwards arrow.

这篇关于iText的PDF阅读为1秒向上的箭头ERROR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆