如何修复由 pdfBox 创建的 PDF 中不一致的父树映射 [英] How to heal inconsistent parent tree mappings in a PDF created by pdfBox

查看:92
本文介绍了如何修复由 pdfBox 创建的 PDF 中不一致的父树映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 pdfBox 在 Java 中创建 pdf 文档.由于屏幕阅读器应该可以访问它们,因此我们使用标签并设置父树并将其添加到文档目录中.

请在此处找到.

现在我的问题是:

  1. StructTreeRoot 和 ParentTree 之间有什么联系?
  2. 在 StructTreeRoot/ParentTree 中的何处可以找到在节点 K->K->问题

    注意:这还不是通用的父树重建器.它仅适用于具有特定类型结构树节点和内容的手头测试文件在页面内容流中.对于通用工具,它也必须学会处理其他类型,并且还必须处理例如嵌入 XObject 中的标记内容.

    We are creating pdf documents in Java using pdfBox. Since they should be accessible by Screenreaders, we are using tags and we are setting up a parentTree and we add that to the document catalog.

    Please find an example file here.

    When we check the resulting pdf with PAC3 validator we get 25 errors for inconsistent entries in the structural parent tree.

    Same result but more details in Adobe prefight syntax error check. The error message is

    Inconsistent ParentTree mapping (ParentTree element 0) for structure element 
    Traversal Path:->StructTreeRoot->K->K->[1]->K->[3]->K->[4]
    

    Adobe preflight syntax error check

    When i try to follow that traversal path in pdfBox Debugger, i see an element referencing the ID 22.

    Now my questions are:

    1. What is the connection between the StructTreeRoot and the ParentTree?
    2. Where in the StructTreeRoot/ParentTree can i find the item with ID 22 that is refered to in node K->K->2->K->4->K->4? See image PDF Debugger
    3. What is that Parent Tree element 0 in the Preflight error message? See image Adobe preflight syntax error check

    PDF Debugger

    I think, building accessible pdf with pdfBox as well as error messages from common validation tools are rather poorly documented. Or where can i find more information about it?

    Thanks a lot for your help.

    解决方案

    The issue in your PDF reminds very much of the issue discussed in the last section "Yet another issue with parent tree entries" in this answer to the question "Find Tag from Selection" is not working in tagged pdf? by fascinating coder:

    In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

    Instead you should simply reference the actual parent structure element of the MCID.

    As your question title asks how to heal inconsistent parent tree mappings in a PDF created by pdfBox, here an approach to fix your parent tree by rebulding the parent tree from the structure tree.

    First recursively collect MCIDs and their parent structure tree elements by page, e.g. using a method like this:

    void collect(PDPage page, PDStructureNode node, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        COSDictionary pageDictionary = node.getCOSObject().getCOSDictionary(COSName.PG);
        if (pageDictionary != null) {
            page = new PDPage(pageDictionary);
        }
    
        for (Object object : node.getKids()) {
            if (object instanceof COSArray) {
                for (COSBase base : (COSArray) object) {
                    if (base instanceof COSDictionary) {
                        collect(page, PDStructureNode.create((COSDictionary) base), parentsByPage);
                    } else if (base instanceof COSNumber) {
                        setParent(page, node, ((COSNumber)base).intValue(), parentsByPage);
                    } else {
                        System.out.printf("?%s\n", base);
                    }
                }
            } else if (object instanceof PDStructureNode) {
                collect(page, (PDStructureNode) object, parentsByPage);
            } else if (object instanceof Integer) {
                setParent(page, node, (Integer)object, parentsByPage);
            } else {
                System.out.printf("?%s\n", object);
            }
        }
    }
    

    (RebuildParentTreeFromStructure method)

    with this helper method

    void setParent(PDPage page, PDStructureNode node, int mcid, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        if (node == null) {
            System.err.printf("Cannot set null as parent of MCID %s.\n", mcid);
        } else if (page == null) {
            System.err.printf("Cannot set parent of MCID %s for null page.\n", mcid);
        } else {
            Map<Integer, PDStructureNode> parents = parentsByPage.get(page);
            if (parents == null) {
                parents = new HashMap<>();
                parentsByPage.put(page, parents);
            }
            if (parents.containsKey(mcid)) {
                System.err.printf("MCID %s already has a parent. New parent rejected.\n", mcid);
            } else {
                parents.put(mcid, node);
            }
        }
    }
    

    (RebuildParentTreeFromStructure helper method)

    and then rebuild based on the collected information:

    void rebuildParentTreeFromData(PDStructureTreeRoot root, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        int parentTreeMaxkey = -1;
        Map<Integer, COSArray> numbers = new HashMap<>();
    
        for (Map.Entry<PDPage, Map<Integer, PDStructureNode>> entry : parentsByPage.entrySet()) {
            int parentsId = entry.getKey().getCOSObject().getInt(COSName.STRUCT_PARENTS);
            if (parentsId < 0) {
                System.err.printf("Page without StructsParents. Ignoring %s MCIDs.\n", entry.getValue().size());
            } else {
                if (parentTreeMaxkey < parentsId)
                    parentTreeMaxkey = parentsId;
                COSArray array = new COSArray();
                for (Map.Entry<Integer, PDStructureNode> subEntry : entry.getValue().entrySet()) {
                    array.growToSize(subEntry.getKey() + 1);
                    array.set(subEntry.getKey(), subEntry.getValue());
                }
                numbers.put(parentsId, array);
            }
        }
    
        PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(PDParentTreeValue.class);
        numberTreeNode.setNumbers(numbers);
        root.setParentTree(numberTreeNode);
        root.setParentTreeNextKey(parentTreeMaxkey + 1);
    }
    

    (RebuildParentTreeFromStructure method)

    Applied like this

    PDDocument document = PDDocument.load(SOURCE));
    rebuildParentTree(document);
    document.save(RESULT);
    

    (RebuildParentTreeFromStructure test testTestdatei)

    PAC3 and Adobe Preflight (at least of my old Acrobat 9.5) go all green for the result:

    Beware: This is no generic parent tree rebuilder yet. It is made to work for the test file at hand with a specific kind of structure tree nodes and content only in page content streams. For a generic tool it has to learn to cope with other kinds, too, and to also process e.g. marked content in embedded XObjects.

    这篇关于如何修复由 pdfBox 创建的 PDF 中不一致的父树映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆