无法从pdf复制特殊字符 [英] Not able to copy special character from pdf

查看:860
本文介绍了无法从pdf复制特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只需要从pdf文件中复制一些包含特殊字符的文字,但将 dash( - )特殊字符转换为 2



PFA从下面链接

  http://www.fileconvoy.com/dfl.php ?id = g6a3426746a10af3b9992384375c5923396bce3660 

附件有pdf源文件从我必须复制数据, image ..
需要紧急帮助。我也尝试使用Google文档和Adobe Pro从pdf中复制数据,但是我每次都得到类似的结果。

解决方案



PDF中的所有信息都表示,您看到的PDF中的字形实际上代表了两个。因此,为了不同地解释这些字形,你必须在PDF中的字体中根本改变该字符的值到Unicode的映射,或者采用光学字符识别。



详细:



让我们来看看您的PDF pg_0001.pdf内容流的哪一部分,



  0 -1.1065 TD 
[(Fibroblast)-241.2 234.1(因子-21)-237.3(I - (FGF-21\\)) - 242.3(活性)-233.9(活性)-237(高脂肪)-237.9 )) - 238.3(馈送)-234(ApoE)] TJ
/ F6 1 Tf
6.7246 0 0 5.9768 357.3354 542.4944 Tm
(2)Tj
/ F4 1 Tf
.8346 0 TD
(/)Tj
/ F6 1 Tf
.3372 0 TD
(2)Tj
/ F4 1 Tf $ b b 8.9663 0 0 8.9663 372.9826 538.5259 Tm
[(小鼠)-235.6(有)-233.5(脂联素)-240.8(\(Acrp30 \))-237.6(敲低)] TJ

这里的特殊字符每个都由字体2(= 50 = 0x32) / F6 。



由于从这里的字符串到实际打印的字形的映射可能是相当任意的,并且可能有正确解释的提示, ,我们应该查看该网页上 / F6 字体的定义:

 < ; 
/ FirstChar 44
/ ToUnicode 21 0 R
/ Encoding 22 0 R
/ FontDescriptor 23 0 R
/ BaseFont / KAHBDA + AdvP7DA6
/ Subtype / Type1
/ LastChar 50
/ Type / Font
/ Widths [833 0 0 0 0 0 833]
>>

因此,您的字体通过 / ToUnicode 映射应该用来解释内容流中的字符。让我们看看这个映射:

  / CIDInit / ProcSet findresource begin 12 dict begin begincmap / CIDSystemInfo< 
/ Registry(F6 + 0)/ Ordering(T1UV)/ Supplement 0>> def
/ CMapName / F6 + 0 def
/ CMapType 2 def
1 begincodespacerange< 2c> < 32> endcodespacerange
2 beginbfchar
< 2c> < 002C>
< 32> < 0032>
endbfchar
endcmap CMapName currentdict / CMap defineresource pop end end

2'= 0x32这里映射到表示再次为2的Unicode代码0x0032的<0032>。



如果 / ToUnicode 映射不存在,文本提取程序可以改用PDF对象22 0中的 / Encoding 定义。但是这里:

  22 0 obj 
<<
/类型/编码
/区别[44 / comma 50 / two]
>>

这里,'2'= 50映射到名为 / two



因此,PDF中的所有信息都缺少字形图定义本身(理论上可以



要使文本提取程序将该字形更多地解释为您的喜欢,您应该将< 32>的 / ToUnicode 映射替换为< 002D。不幸的是,映射被编码(使用过滤器 / FlateDecode ),因此这不是一个容易的十六进制编辑器作业,而是需要解码等...


I just need to copy some text including special character from pdf file , but special character like dash(-) get's converted into 2.

PFA from below link

http://www.fileconvoy.com/dfl.php?id=g6a3426746a10af3b9992384375c5923396bce3660

Attachment have pdf source file from where I have to copy data , and other is screenshot image.. Need urgent help.I have also tried to copy data from pdf using Google Docs and Adobe Pro , but similar result I get every time.

解决方案

In a nutshell:

All information in your PDF indicates that the glyphs in your PDF you see as dashes actually indeed represent a two. Thus, to interpret those glyphs differently you have to either fundamentally change the value-to-unicode mappings for that character in its font in your PDF or resort to optical character recognition.

In detail:

Let's look into that part of your PDF pg_0001.pdf's content stream from which the words marked by you

are created:

0 -1.1065 TD
[(Fibroblast)-241.2(growth)-234.1(factor-21)-237.3(\(FGF-21\))-242.3(activity)-233.9(in)-237(High-fat)-237.9(diet)-234.9(\(HFD\))-238.3(fed)-234(ApoE)]TJ
/F6 1 Tf
6.7246 0 0 5.9768 357.3354 542.4944 Tm
(2)Tj
/F4 1 Tf
.8346 0 TD
(/)Tj
/F6 1 Tf
.3372 0 TD
(2)Tj
/F4 1 Tf
8.9663 0 0 8.9663 372.9826 538.5259 Tm
[(mice)-235.6(with)-233.5(adiponectin)-240.8(\(Acrp30\))-237.6(knockdown.)]TJ

Your special characters here indeed are each represented by the character '2' (= 50 = 0x32) from the font /F6.

As the mapping from character in the string here to actually printed glyph may be quite arbitrary and there may be hints for the correct interpretation, though, we should look into the definition of that font /F6 on that page:

<<
  /FirstChar 44
  /ToUnicode 21 0 R
  /Encoding 22 0 R
  /FontDescriptor 23 0 R
  /BaseFont /KAHBDA+AdvP7DA6
  /Subtype /Type1
  /LastChar 50
  /Type /Font
  /Widths [833 0 0 0 0 0 833]
>> 

So your font is enhanced by a /ToUnicode mapping which text extracting programs should use to interpret the characters in the content stream. Let's look at that mapping:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <2c> <32> endcodespacerange
2 beginbfchar
<2c> <002C>
<32> <0032>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

Thus the '2' = 0x32 here is mapped to <0032> representing the Unicode code 0x0032 which once again is '2'.

If the /ToUnicode mapping was not present, a text extracting program could instead have used the /Encoding definition in the PDF object 22 0. But here again:

22 0 obj 
<<
  /Type /Encoding
  /Differences [44 /comma 50 /two]
>> 

Here the '2' = 50 is mapped to the glyph named /two which once again makes that glyph a two.

Thus, all information in your PDF short of the glyph drawing definition itself (which could theoretically be checked by OCR'ing) indicates that dash glyph is indeed a two.

To make a text extraction program interpret that glyph more to your liking, you should replace the /ToUnicode mappings of <32> to e.g. <002D>. Unfortunately that mapping is encoded (with filter /FlateDecode), thus that's no easy hex editor job but instead requires decoding etc...

这篇关于无法从pdf复制特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆