PDFBox删除评论保持删除线 [英] PDFBox delete comment maintain strikethrough

查看:194
本文介绍了PDFBox删除评论保持删除线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个对段落有评论的PDF.本段被删除.我的要求是从特定页面删除命令.

以下代码应从我的PDF中删除特定注释,但不能删除.

PDDocument document = PDDocument.load(...File...);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    List<PDAnnotation> annotationToRemove = new ArrayList<PDAnnotation>();

    if (annotations.size() < 1)
        continue;
    else {
        for (PDAnnotation annotation : annotations) {

            if (annotation.getContents() != null && annotation.getContents().equals("Sample Strikethrough")) {
                annotationToRemove.add(annotation);
            }
        }
        annotations.removeAll(annotationToRemove);
    }
}

删除特定评论并在应用该评论的文本上保留删除线的最佳方法是什么?

解决方案

删除特定评论并在应用该评论的文本上保留删除线的最佳方法是什么?

您发现的注释实际上是 StrikeOut 子类型的文本标记注释,即该注释的主要外观是删除线.因此,您不得删除此注释.相反,您应该删除从中生成注释的其他外观(悬停文本)的数据.

这可以像这样完成:

final COSName POPUP = COSName.getPDFName("Popup");

PDDocument document = PDDocument.load(resource);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

List<COSObjectable> objectsToRemove = new ArrayList<>();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    for (PDAnnotation annotation : annotations) {
        if ("StrikeOut".equals(annotation.getSubtype()))
        {
            COSDictionary annotationDict = annotation.getCOSObject();
            COSBase popup = annotationDict.getItem(POPUP);
            annotationDict.removeItem(POPUP);            // popup annotation
            annotationDict.removeItem(COSName.CONTENTS); // plain text comment
            annotationDict.removeItem(COSName.RC);       // rich text comment
            annotationDict.removeItem(COSName.T);        // author

            if (popup != null)
                objectsToRemove.add(popup);
        }
    }

    annotations.removeAll(objectsToRemove);
}

(

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

The annotation you found actually is a text markup annotation of subtype StrikeOut, i.e. the main appearance of this annotation is the strikethrough. Thus, you must not remove this annotation. Instead you should remove the data from which the additional appearance of the annotation, the hover text, is generated.

This can be done like this:

final COSName POPUP = COSName.getPDFName("Popup");

PDDocument document = PDDocument.load(resource);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

List<COSObjectable> objectsToRemove = new ArrayList<>();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    for (PDAnnotation annotation : annotations) {
        if ("StrikeOut".equals(annotation.getSubtype()))
        {
            COSDictionary annotationDict = annotation.getCOSObject();
            COSBase popup = annotationDict.getItem(POPUP);
            annotationDict.removeItem(POPUP);            // popup annotation
            annotationDict.removeItem(COSName.CONTENTS); // plain text comment
            annotationDict.removeItem(COSName.RC);       // rich text comment
            annotationDict.removeItem(COSName.T);        // author

            if (popup != null)
                objectsToRemove.add(popup);
        }
    }

    annotations.removeAll(objectsToRemove);
}

(RemoveStrikeoutComment.java test testRemoveLikeStephanImproved)


As a side effect of looking into this a PDFBox bug became apparent: The original code by the OP should have removed the StrikeOut annotation completely but it did nothing. The reason is a bug in the usage of the COSArrayList class in the context of page annotations.

The page annotation list returned by page.getAnnotations() is an instance of COSArrayList. This class carries both a list of COS objects as they appear in the page Annots array and a list of wrappers for those entries (after resolving indirect references where necessary).

The removeAll method (sensibly) checks its argument collection for such wrappers and removes the actual COS objects, not the wrappers, from the former collection and the argument collection as is (i.e. with wrappers) from the latter.

This works well for direct objects in the Annots array, but entries in the former list which are indirect references aren't properly removed as the code tries to remove the resolved annotation dictionaries while that list actually contains indirect references.

In the case at hand that results in removals not being written back. In more generic situations the results can even be weirder as the two lists have different sizes now. Index oriented methods, therefore, can now manipulate non-corresponding objects of the lists...

(BTW, in my code above I remove an indirect reference, not a wrapper, leaving the lists in disarray, too, as this time only an entry of the former, not the latter list is removed; probably this should also be handled more securely.)

A similar problem occurs in the retainAll method.

Another glitch: COSArrayList.lastIndexOf uses indexOf of the contained list.

The PDFBox source this has been analysed with is the current 3.0.0-SNAPSHOT, but the error occurs with all versions 2.0.0 - 2.0.7, so their code very likely contains these errors, too.

这篇关于PDFBox删除评论保持删除线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆