为什么可以复制带有嵌入字体的pdf文档,但无法在pdf阅读器中搜索 [英] Why a pdf document with embedded fonts can be copied but is not searchable in pdf reader

查看:360
本文介绍了为什么可以复制带有嵌入字体的pdf文档,但无法在pdf阅读器中搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写带有嵌入式子集字体的pdf文件.根据需要,我包括ToUnicode和CIDSet对象.为了进行测试,我创建了一个带有两个希伯来字符的简单PDF.我可以选择两个字符并将其复制到剪贴板,然后将其正确粘贴到其他应用程序(如Word)中.但是我无法搜索包含这两个字符的单词. Adobe Reader(或Acrobat)显示消息,提示找不到该单词.因此,从本质上讲,我创建了一个PDF文档,该文档可以正确复制,但不可搜索.知道我在创建文档时可能会缺少什么吗?

I am writing a pdf files with embedded subset fonts. As required, I am including the ToUnicode and CIDSet objects. To test, I created a simple PDF with two Hebrew characters. I can select the two characters and copy to the clipboard, and paste it properly into another application such as Word. But I am not able to search for a word containing these two characters. Adobe Reader (or Acrobat) displays the message that the word was not found. So in essence, I have created a PDF document which can be copied properly, but is not searchable. Any idea what I might be missing when creating the document?

其他信息: 1.该文件是一个只有两个字符的最小文件.我已经用许多不同的语言(包括英语)测试了许多这样的文件.没有文件可搜索. 2.奇怪的是,如果我搜索字母"e",即使文件中不存在字母"e",Adobe Reader也会突出显示一个错误的单词. 3. Adob​​e acrobat也无法在该文件中搜索,但是当我将文件保存到另一个磁盘文件时,现在可以搜索已保存的文件.我确认主要对象(例如字体文件,ToUnicode对象,CID对象和字体描述对象)在保存的文件中是相同的.但是,字体对象之一靠近文件顶部. 4. FoxIt能够正确搜索这些文件.

Additional information: 1. The file in question is a minimal file with just two characters. I have tested with many such files in many different languages including English. None of the files are searchable. 2. Curiously, if I search for the letter 'e', Adobe reader highlights an incorrect word, even if the letter 'e' does not exists in the file. 3. Adobe acrobat is also not able to search within this file, however when I save the file to another disk file, the saved file now is searchable. I confirmed that the major objects such as the font-file, ToUnicode object, CID object, and the font description objects are the same in the saved file. However, one of the font object is brought up closer to the top of the file. 4. FoxIt is able to search these files properly.

相关的PDF对象:

5 0 obj

<>

    q 0.750000 0 0 0.750000 0.000000 792.000000 cm 

    q q q 0.160000 0.000000 0.000000 0.160000 0.000000 0.000000 cm 

    BT /F0 100.000000 Tf 0 g 750.000000 -690 Td[<02B0>] TJ 35.000000 0 Td[<02B9>] TJ ET Q

    Q 

    Q

    Q

endstream

endstream

endobj

10 0 obj

<>

endobj

11 0 obj

<>/FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font >>

<> /FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font>>

endobj

12 0 obj

<>

endobj

8 0 obj

<>

    /CIDInit /ProcSet findresource begin

    12 dict begin

    begincmap

    /CIDSystemInfo

    << /Registry (Adobe)

    /Ordering (UCS) /Supplement 0 >> def

    /CMapName /Adobe-Identity-UCS def

    /CMapType 2 def

    1 begincodespacerange

    <0000> <FFFF>

    endcodespacerange

    3 beginbfchar

    <0000> <0000>

    <02B0> <05E0>

    <02B9> <05E9>

    endbfchar

    endcmap

    CMapName currentdict /CMap defineresource pop

    end

    end

endstream

endstream

endobj

推荐答案

简而言之

问题是由于用于不同文档的PDF ID 相同.

Adob​​e Reader/Acrobat似乎缓存了用于通过文档的 ID 标识该文档的文档的搜索信息. OP的某些文档似乎具有相同的 ID ,至少两个示例文件具有:

Adobe Reader / Acrobat seem to cache search information for documents identifying the document by its ID. Some of the OP's documents seem to have the same ID, at least the two sample files do:

/ID[<754DC77D28E62763C4916970D595A10F><754DC77D28E62763C4916970D595A10F>] 

因此,当OP尝试搜索其时,将使用具有该ID的先前查看的PDF中的搜索信息. test.pdf .考虑他的评论之一的描述:

Thus, search information from earlier viewed PDFs with that ID was used when the OP tried to search his test.pdf. Considering this description from one of his comments:

如果搜索英文字母"e"会发生什么.对我来说,可以选择两个希伯来字母.当我搜索以下英文字母之一时,也会发生同样的情况:d,i,n,o,p,r,t,y,I,N,R,T和Y.

What happens if you search for the English letter 'e'. For me, the two Hebrew letters can selected. The same happens when I search for one of these English letters: d, i, n, o, p, r, t, y, I, N, R, T and Y.

搜索信息似乎已缓存为带有拉丁字形的文档,此外,请考虑对 test_en.pdf (也共享相同ID的文档):

the search information seems to have been cached for a document with Latin glyphs, Furthermore, considering this comment on test_en.pdf (a document sharing the same ID, too):

它有一条英语行:这是一条测试行".当我搜索"This"时,我找到了.但是找不到其他单词.

It has one English line: 'This is a test line'. When I search for "This', I find it. But I can not find the other words.

原始文档的文本似乎以"This"开头,但以不同的方式继续.

the text of the original document seems to have started with "This" but continued differently.

这篇关于为什么可以复制带有嵌入字体的pdf文档,但无法在pdf阅读器中搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆