如何修复 PDFBox 设置的 PDF/A 元数据(使用 Docx4j 和 XDocReport) [英] How to fix PDF/A metadata set by PDFBox (working with Docx4j and XDocReport)

查看:38
本文介绍了如何修复 PDFBox 设置的 PDF/A 元数据(使用 Docx4j 和 XDocReport)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了达到 PDF/A-1A 的辅助功能级别,我使用 PDFBox v2.0.13 在 PDF 上设置 XMP 元数据.在设置元数据之前,我将文件从 .docx 转换为 pdf.我尝试了两种方法来进行转换:一种使用 XDocReport v.2.0.1,另一种使用 Docx4j v.6.1.0.

In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.

在 Java 类中,我有以下代码:

In the Java class I have the following code:

PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());

使用 XDocReport 进行转换,我得到以下元数据:

Making the conversion with XDocReport I get the following metadata:

  </rdf:Description>
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
   </rdf:Description>

使用 Docx4j 进行转换,我得到以下元数据:

Instead making the conversion with Docx4j I get the following metadata:

    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
    </rdf:Description>

由于title"和description"生成的元数据不同,使用XDocReport生成的最终pdf结果PDF/A-1A可访问,而使用Docx4j生成的pdf不可访问.

Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.

可访问性检查是使用 VeraPDF 进行的.

The accessibility check is made using VeraPDF.

由于 Docx4j 生成的 PDF 可读性更强,有没有办法修复最终 pdf 中的元数据?

Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?

推荐答案

这是 xmpbox 与某些其他库一起使用时的已知问题,例如FOP.

This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.

问题出在变压器上.

XmpSerializer.java 中的这段代码:

This code in XmpSerializer.java:

Transformer transformer = TransformerFactory.newInstance().newTransformer();

应该返回一个 com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl 类.(试试)

should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl class. (Try it)

javadoc:https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()

服务 API 将在文件 META-INF/services/javax.xml.transform.TransformerFactory 中的可用于运行时的 jar 文件中查找类名."

"The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."

您可以通过设置系统属性来强制默认实现:

You can force the default implementation by setting a system property:

System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");

然而,这可能会弄乱其他库中的某些内容.

However maybe this will mess up something in the other library.

另一种解决方案是复制 XmpSerializer 的源代码,并像这样更改 newInstance 调用:

A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:

Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();

来源

这篇关于如何修复 PDFBox 设置的 PDF/A 元数据(使用 Docx4j 和 XDocReport)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆