用于注释的pdfbox嵌入子集字体 [英] pdfbox embedding subset font for annotations

查看:97
本文介绍了用于注释的pdfbox嵌入子集字体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Apache PDFBOX v2.0.21 来修改现有的 PDF 文档,添加签名和注释.这意味着我正在积极使用增量保存模式.我还嵌入了 LiberationSans 字体以容纳一些 Unicode 字符.对我来说使用 PDF 嵌入字体的子集功能是有意义的,因为完整嵌入 LiberationSans 会使 PDF 文件的边长增加约 200+ KB.

I am trying to use Apache PDFBOX v2.0.21 to modify existing PDF documents, adding signatures and annotations. That means that I am actively using incremental save mode. I am also embedding LiberationSans font to accommodate some Unicode characters. It makes sense for me to use the subsetting feature of PDF embedded fonts as embedding LiberationSans in full makes the PDF file around 200+ KB more in side.

经过多次试验和错误,我终于设法让一些东西工作 - 除了字体子集之外.我这样做的方法是使用

After multiple trials and errors I finally managed to have something working - all but the font subsetting. The way I do this is to initialize the PDFont object once using

  try (InputStream fs = PDFService.class.getResourceAsStream("/static/fonts/LiberationSans-Regular.ttf")) {
     _font = PDType0Font.load(pddoc, fs, true);
  }

然后使用自定义外观流来显示文本.

And then to use custom Appearance Stream to show the text.

   private void addAnnotation(String name, PDDocument doc, PDPage page, float x, float y, String text) throws IOException {
      
      List<PDAnnotation> annotations = page.getAnnotations();

      PDAnnotationRubberStamp t = new PDAnnotationRubberStamp();

      t.setAnnotationName(name); // might play important role
      t.setPrinted(true); // always visible
      t.setReadOnly(true); // does not interact with user
      t.setContents(text); 
      
      PDRectangle rect = ....;
      t.setRectangle(rect);

      PDAppearanceDictionary ap = new PDAppearanceDictionary();
      ap.setNormalAppearance(createAppearanceStream(doc, t));
      ap.getCOSObject().setNeedToBeUpdated(true);
      t.setAppearance(ap);
      
      annotations.add(t);
      page.setAnnotations(annotations);
      
      t.getCOSObject().setNeedToBeUpdated(true);
      page.getResources().getCOSObject().setNeedToBeUpdated(true);
      page.getCOSObject().setNeedToBeUpdated(true);
      doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
      doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);      
   }
   
   private PDAppearanceStream createAppearanceStream(final PDDocument document, PDAnnotation ann) throws IOException
   {
      PDAppearanceStream aps = new PDAppearanceStream(document);
      PDRectangle rect = ann.getRectangle();
      rect = new PDRectangle(0, 0, rect.getWidth(), rect.getHeight());
      aps.setBBox(rect); // set bounding box to the dimensions of the annotation itself
     
      // embed our unicode font (NB: yes, this needs to be done otherwise aps.getResources() == null which will cause NPE later during setFont)
      PDResources res = new PDResources();
      _fontName = res.add(_font).getName();
      aps.setResources(res);

      PDAppearanceContentStream apsContent = null;
      
      try {
         // draw directly on the XObject's content stream
         apsContent = new PDAppearanceContentStream(aps);

         apsContent.beginText();
         apsContent.setFont(_font, _fontSize);         
         apsContent.showText(ann.getContents());
         apsContent.endText();
      }
      finally {
         if (apsContent != null) {
            try { apsContent.close(); } catch (Exception ex) { log.error(ex.getMessage(), ex); }
         }
      }      

      aps.getResources().getCOSObject().setNeedToBeUpdated(true);
      aps.getCOSObject().setNeedToBeUpdated(true);
      return aps;
   }     

此代码运行,但创建了一个带有点而不是实际字符的 PDF,我猜这意味着尚未嵌入字体子集.此外,我收到以下警告:

This code runs, but creates a PDF with dots instead of actual characters, which, I guess, means that the font subset has not been embedded. Moreover, I get the following warnings:

2021-04-17 12:33:31.326 WARN 20820 --- [主要]o.a.p.pdmodel.PDAbstractContentStream :尝试使用子集字体 LiberationSans 没有适当的上下文

2021-04-17 12:33:31.326 WARN 20820 --- [ main] o.a.p.pdmodel.PDAbstractContentStream : attempting to use subset font LiberationSans without proper context

在查看源代码后,我得到并且我想我在创建外观流时搞砸了一些东西 - 不知何故它没有与 PDDocument 连接,并且子集不能正常继续.请注意,当字体完全嵌入时,上述代码运行良好(即,如果我调用 PDType0Font.load 并将最后一个参数设置为 false)

After looking through the source code, I get and I guess that I am messing something up when creating the appearance stream - somehow it's not connected with the PDDocument and the subsetting does not continue normally. Note that the above code works well when the font is embedded fully (i.e. if I call PDType0Font.load with the last parameter set to false)

谁能给我一些提示?谢谢!

Can anyone think of some hint to give to me? Thank you!

推荐答案

我不知道 - 我幸运吗?编程中的幸运常常指向完全错误或误导的事情.无论如何,如果还有人能给出提示,我的耳朵就大开……

I don't know - am I lucky? It is very often that luckiness in programming points to something completely wrong or misleading. In any case, if someone can still give a hint, my ears are more than open...

再次查看代码后,我在 PDDocument.save() 中看到了以下内容:

Again, after looking through the code, I saw the following in PDDocument.save():

// subset designated fonts
for (PDFont font : fontsToSubset)
{
    font.subset();
}

这在我使用的 PDDocument.saveIncremental() 中没有发生.只是为了弄乱代码,我在对我的文档调用 saveIncremental() 之前执行了以下操作:

This is not happening in PDDocument.saveIncremental() which I am using. Just to mess around with the code, I went and did the following just before calling saveIncremental() on my document:

 _font.subset(); // you can see in the beginning of the question how _font is created
 _font.getCOSObject().setNeedToBeUpdated(true);
 pddoc.saveIncremental(baos);

信不信由你,但文档已正确保存 - 至少它在 Acrobat Reader DC 和 Chrome & 中看起来是正确的Firefox PDF 查看器.请注意,在外观内容流的 showText() 期间,Unicode 代码点被添加到字体的子集中.

Believe it or not, but the document was saved correctly - at least it appears correct in Acrobat Reader DC and Chrome & Firefox PDF viewers. Note that Unicode codepoints are added to the subset for the font during showText() on appearance content stream.

更新 18/04/2021:正如我在评论中提到的,我收到用户的报告,他们开始看到诸如无法从...中提取嵌入字体 XXXXXX+LiberationSans-Regular"之类的消息.",当他们打开修改后的 PDF 文件时.奇怪的是,我在测试期间没有看到这些消息.事实证明,我的 Acrobat Reader DC 副本比他们的新,特别是在连续发布版本 2021.001.20149 中没有显示错误,而在连续发布版本 2020.012.20043 中显示了上述消息.

UPDATE 18/04/2021: as I mentioned in the comments, I got reports from users that started seeing messages like "Cannot extract the embedded font XXXXXX+LiberationSans-Regular from ...", when they opened the modified PDF files. Strangely enough, I didn't see these messages during my tests. It turns out that my copy of Acrobat Reader DC was newer than theirs, and specifically with the continuous release version 2021.001.20149 no errors were shown, while with the continuous release version 2020.012.20043 the above message was shown.

经过调查,发现问题出在我嵌入字体的方式上.我不知道是否存在任何其他方式,而且我对 PDF 规范不太熟悉,不知道其他方式.从上面的代码可以看出,我所做的是为文档加载一次字体,然后在每个注释的外观流的资源字典中自由使用它.因此,注释内容流的所有资源字典都引用了使用 SAME/BaseFont 名称定义的 F1 字体.PDF 参考,第 3 版.在 p.323 上特别指出:

After investigations, it turns out that the problem was with the way I was embedding the font. I am not aware if any other way exists, and I am not that familiar with the PDF specification to know otherwise. What I was doing, as you can see from the above code, was to load the font ONCE for the document, and then to use it freely in the resource dictionary of the appearance stream of EVERY annotation. This had as a result all the resource dictionaries of the annotation content streams to reference an F1 font that was defined with the SAME /BaseFont name. The PDF Reference, 3rd ed. on p.323 specifically states that:

"...字体的 PostScript 名称 - ... - 以标签开头后跟一个加号 (+).标签正好由六个大写字母组成信件;字母的选择是任意的,但不同的子集同一个PDF文件必须有不同的标签……"

"... the PostScript name of the font - ... - begins with a tag followed by a plus sign (+). The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags..."

一旦我开始为我的每个注释调用 PDType0Font.load 并在为每个注释创建外观流后调用 subset()(当然还有 setNeedToBeUpdated),我看到 BaseName 属性开始看起来确实不同了 - 事实上,较旧的 2020 版 Acrobat Reader DC 停止抱怨.

Once I started to call PDType0Font.load for each of my annotations and calling subset() (and of course setNeedToBeUpdated) after creating appearance stream for each of them, I saw that the BaseName attributes started to look indeed differently - and indeed, the older 2020 version of Acrobat Reader DC stopped complaining.

请注意,除了使用 iText RUPS 检查 PDF 内容外,您还可以使用 Foxit PDF 查看器至少确保子集字体名称不同.Acrobat Reader DC 和 PDF-xChange 属性 ->字体只显示初始字体名称,如 LiberationSans,不显示 6 个字母的唯一前缀.

Note that other than using iText RUPS for inspecting the PDF contents, one could use Foxit PDF viewer to at least ensure that the subset font names are different. Acrobat Reader DC and PDF-xChange in Properties -> Fonts just show the initial font name, like LiberationSans, without showing the 6-letter unique prefix.

UPDATE 19/04/2021 我仍在处理这个问题 - 因为我仍然收到关于臭名昭著的无法提取嵌入字体"的报告.信息.很可能该消息的原始原因不是(或不仅仅是)不同子集具有相同 BaseFont 名称的事实.我观察到的一件事是,在某些计算机上,我使用的图章注释会导致 Acrobat Reader DC 自动打开所谓的评论窗格".- 有一些选项可以关闭此自动功能(首选项 -> 评论 -> 打开带有评论的 PDF 时显示评论窗格).当此窗格手动或自动打开时,会出现错误消息(我很想知道为什么相同版本的 Acrobat Reader DC 对不同机器的行为不同).我认为 Acrobat Reader 尝试提取字体的完整版本并失败了,因为它只是一个子集.但是,我想,这与文档的语义内容无关 - 文档仍然通过qpdf --check".我目前正在尝试寻找是否可以限制图章以不允许评论 - 即某种方法来禁用 Acrobat Reader DC 中的评论窗格,尽管我希望渺茫.

UPDATE 19/04/2021 I am still working on this issue - because I still get reports about the infamous "Cannot extract the embedded font" message. It is quite possible that the original cause of that message was not (or not only) the fact that the different subsets had same BaseFont names. One thing that I am observing is that on some computers, the stamp annotations that I am using cause Acrobat Reader DC to open automatically the so called "Comments pane" - there are options to turn this automatic thing off (Preferences -> Commenting -> Show comments pane when a PDF with comments is opened). When this pane opens, either manually or automatically, the error message appears (and I was on my wits ends to see why same version of Acrobat Reader DC behaves differently for different machines). I think that Acrobat Reader tries to extract the full version of the font and fails, since it is only a subset. But, I guess, this doesn't have to do with the semantic contents of the document - the document still passes "qpdf --check". I am currently trying to find if it is possible to restrict stamps to not allow comments - i.e. some way to disable the comments pane in Acrobat Reader DC, although I have little hope.

UPDATE 20/04/2021 打开了一个新问题 这里

这篇关于用于注释的pdfbox嵌入子集字体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆