从一个PDF提取到Unicode映射,然后在另一个PDF中使用 [英] Extract toUnicode map from One PDF and use in another

查看:532
本文介绍了从一个PDF提取到Unicode映射,然后在另一个PDF中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Unicode PDF文档,但缺少toUnicode映射.我有一个具有toUnicode映射的具有相同字体的不同PDF.我可以从一个PDF中提取它,并用它从另一个PDF中提取文本吗?

I have a Unicode PDF document which misses the toUnicode map. I have a different PDF with the same font which has the toUnicode map. Can I extract it from one PDF and use it to extract text from the other PDF?

推荐答案

对于Unicode映射Adobe有特殊的资源/ToUnicode 您可以在字体资源说明内的pdf文件中找到它.看起来像

For Unicode mapping Adobe has special resource /ToUnicode You can find it in the pdf file inside of Font resource description. It looks like

<</BaseFont /ONWALI+Sylfaen/DescendantFonts [10 0 R]/Encoding /Identity-H/Subtype /Type0/ToUnicode 11 0 R/Type /Font>>

/ToUnicode 11 0 R是您需要包含在pdf文件中. 11 0是资源ID

and /ToUnicode 11 0 R is that you need to have in the pdf file. 11 0 is a resource ID

我在Acrobat Pro中创建了带有所有字母符号的pdf示例,以使用与报告中使用的相同字体进行标准的ToUnicode映射.我已将资源提取为文本,看起来像:

I've created sample pdf with all alphabet symbols in Acrobat Pro to have standard ToUnicode mapping using the same font that is used in the report. I've extracted resource as text, it looks something like:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
50 beginbfchar
<0003> <0020>
...and so on...
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

ToUnicode资源通常是压缩的,因此您必须解压缩它才能获得上面的文本.

ToUnicode resource is compressed usually so you have to decompress it to get text like above.

然后,我编写了采用pdf的代码(来自Misrosoft Reporting中的生成报告),并为找到的每种字体添加了/ToUnicode资源. PDF具有带有指针的外部参照表,您不能将其编辑为文本文件.因此,您必须使用一些pdf引擎(我使用过PDFTron,但itext应该足够了).每当我需要将报告另存为pdf时,都会执行此后处理代码. 实际上,ToUnicode映射应该由Microsoft Reporting引擎填充,但这实在太好了.

Then I've wrote code that takes pdf (from generated report in Misrosoft Reporting) and adds /ToUnicode resource for each font found. Pdf have xref table with pointers and you cann't edit it as text file. So you have to use some pdf engine (I've used PDFTron but itext should be enough). This post-processing code is executed each time I need to save report as pdf. Actually ToUnicode mapping should be filled by Microsoft Reporting engine, but it is too good to be true.

就是这样.

这篇关于从一个PDF提取到Unicode映射,然后在另一个PDF中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆