如何知道字段是否在特定页面上? [英] how to know if a field is on a particular page?

查看:159
本文介绍了如何知道字段是否在特定页面上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PDFbox内容流是按页面完成的,但这些字段来自目录中的表格,该表格来自pdf文档本身。所以我不确定哪些字段在哪些页面上,以及它导致将文本写入错误的位置/页面。

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.

ie。我正在处理每页的字段,但不确定哪些字段在哪些页面上。

ie. I'm processing fields per page, but not sure which fields are on which pages.

有没有办法告诉哪个字段在哪个页面上?或者,有没有办法只获取当前页面上的字段?

Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?

谢谢!

Mark

代码段:

PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();

// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
  PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
  processFields(acroForm, fieldList, contentStream, page);
  contentStream.close();
}


推荐答案


PDFbox内容流是按页面完成的,但字段来自目录中的表单,该表格来自pdf文档本身。所以我不确定哪些字段在哪些页面上

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages

原因是PDF包含定义表单的全局对象结构。此结构中的表单字段可以在0,1或更多实际PDF页面上具有0,1或更多可视化。此外,在只有1个可视化的情况下,允许合并字段对象和可视化对象。

The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.

不幸的是,PDFBox在其 PDAcroForm PDField 对象中仅代表此对象结构,并且不提供对相关页面。但是,通过访问底层结构,您可以构建连接。

Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.

以下代码应该明确如何做到这一点:

The following code should make clear how to do that:

@SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
    PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();

    List<PDPage> pages = docCatalog.getAllPages();
    Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = pages.get(i);
        for (PDAnnotation annotation : page.getAnnotations())
            pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
    }

    PDAcroForm acroForm = docCatalog.getAcroForm();

    for (PDField field : (List<PDField>)acroForm.getFields()) {
        COSDictionary fieldDict = field.getDictionary();

        List<Integer> annotationPages = new ArrayList<Integer>();
        List<COSObjectable> kids = field.getKids();
        if (kids != null) {
            for (COSObjectable kid : kids) {
                COSBase kidObject = kid.getCOSObject();
                if (kidObject instanceof COSDictionary)
                    annotationPages.add(pageNrByAnnotDict.get(kidObject));
            }
        }

        Integer mergedPage = pageNrByAnnotDict.get(fieldDict);

        if (mergedPage == null)
            if (annotationPages.isEmpty())
                System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
            else
                System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
        else
            if (annotationPages.isEmpty())
                System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
            else
                System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
    }
}

小心,有PDFBox中的两个缺点 PDAcroForm 表单字段处理:

Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:


  1. PDF规范允许定义表单的全局对象结构是深树,即实际字段不必是根的直接子节点,而是可以通过内部树节点来组织。 PDFBox忽略了这一点,并希望这些字段是根的直接子节点。

  1. The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.

野外的一些PDF,最重要的是旧的,不包含字段树但是仅通过可视化窗口小部件注释从页面引用字段对象。 PDFBox在 PDAcroForm.getFields 列表中没有看到这些字段。

Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.



PS @mikhailvs 。 com / a / 31461710/1729265>他的回答正确显示您可以使用 PDField.getWidget()从字段小部件中检索页面对象.getPage()并使用 catalog.getAllPages()。indexOf 确定其页码。快速这个 getPage()方法有一个缺点:它从小部件注释字典的可选条目中检索页面引用。因此,如果您处理的PDF是由填充该条目的软件创建的,那么一切都很好,但如果PDF创建者没有填写该条目,那么您获得的只是 null 页。

PS: @mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.

在2.0.x中,一些访问相关元素的方法有为了安全地检索小部件的页面,您仍然需要遍历页面并找到引用该注释的页面。

In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.

安全方法:

int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
    COSDictionary widgetObject = widget.getCOSObject();
    PDPageTree pages = document.getPages();
    for (int i = 0; i < pages.getCount(); i++)
    {
        for (PDAnnotation annotation : pages.get(i).getAnnotations())
        {
            COSDictionary annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject))
                return i;
        }
    }
    return -1;
}

快速方法

int determineFast(PDDocument document, PDAnnotationWidget widget)
{
    PDPage page = widget.getPage();
    return page != null ? document.getPages().indexOf(page) : -1;
}

用法:

PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
    for (PDField field : acroForm.getFieldTree())
    {
        System.out.println(field.getFullyQualifiedName());
        for (PDAnnotationWidget widget : field.getWidgets())
        {
            System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
            System.out.printf(" - fast: %s", determineFast(document, widget));
            System.out.printf(" - safe: %s\n", determineSafe(document, widget));
        }
    }
}

DetermineWidgetPage。 java

(与1.8.x代码相比,这里的安全方法只搜索单个字段的页面。如果在你必须确定许多小部件页面的代码,你应该像1.8.x一样创建一个查找 Map 。)

(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)

快速方法失败的文档: aFieldTwice.pdf

A document for which the fast method fails: aFieldTwice.pdf

快速方法适用的文档: test_duplicate_field2.pdf

A document for which the fast method works: test_duplicate_field2.pdf

这篇关于如何知道字段是否在特定页面上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆