PDFBOX 2.0.18 - 如何遍历 PDF 页面并检索特定字段 [英] PDFBOX 2.0.18 - How to iterates through pages of a PDF and retrieve specific fields

查看:109
本文介绍了PDFBOX 2.0.18 - 如何遍历 PDF 页面并检索特定字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PDFBox 读取 pdf 文档上的特定字段.实际上,我可以使用仅包含一页的 pdf 获取我想要的所有信息.PDF 具有特定名称的字段,我可以获取所有字段并将其插入数据库.

I'm using PDFBox to read specific fields on a pdf document. Actually, I'm able to get all the informations I want with a pdf containing only one page. The PDF has fields with specific names and I can get all the fields and insert it in a database.

我将此代码与 AccroForm 一起使用以访问字段

I use this code with AccroForm to access the fields

InputStream document = item.getInputStream();
pdf = PDDocument.load(new RandomAccessBufferedFileInputStream(document));
pdCatalog = pdf.getDocumentCatalog();
pdAcroForm = pdCatalog.getAcroForm();

String dateRapport = pdAcroForm.getField("import_Date01").getValueAsString();
String radioReason = pdAcroForm.getField("NoFlight").getValueAsString();
boolean hasdata = false;

if(radioRaison.length() > 0 && !radioRaison.equals("Off")) {
    if(radioRaison.equals("NR")) {
        rvhi.setRaison(obtenirRaison(raisons, "NR"));
    }else if(radioRaison.equals("WX")) {
        rvhi.setRaison(obtenirRaison(raisons, "ME"));
    }else if(radioRaison.equals("US")) {
        rvhi.setRaison(obtenirRaison(raisons, "BR"));
    }
}
if(pdAcroForm.getField("import_Hmn0"+indexEnString).getValueAsString().length() > 0) 
{
    hasdata = true
}

pdf.close();

return hasdata;

现在,我的问题是对包含多个具有相同字段名称但字段中数据不同的相同页面的 pdf 执行相同的操作.我想遍历每个页面并调用相同的方法并检索每个页面上的字段数据.

Now, my problem is to do the same thing with a pdf that contains multiple identical pages with the same field names, but with different data in the fields. I would like to iterate through each pages and call the same method and retrieve the fields data on each page.

我使用下面的这段代码来遍历 pdf 的页面,但我不知道如何获取当前页面上的字段...我不知道如何从 PDPage 对象获取 acroform 字段?

I use this code below to iterate through pages of the pdf, but I don't know how to get the fields on the current page... I don't know how to get the acroform fields from the PDPage object?

PDPageTree nbPages = pdf.getPages();

if(nbPages.getCount() > 1) {
    for(PDPage page : nbPages) {
        ???? how to get fields Acroform from PDPage page ???
    }
}

预先感谢您的回复!

推荐答案

不存在当前页面的 PDField 对象列表;AcroForm 是文档范围的.所以你的问题的第一部分已经获得了文档中字段的完整列表.(Adobe 的 PDF 规范中的 12.7.1)

There is no such thing as a list of PDField objects for the current page; an AcroForm is document wide. So the first part of your question already gets the full list of fields in the document. (12.7.1 in the PDF Specification from Adobe)

字段可以具有相同的完全限定名称,但它们的值也必须相同.(PDF 规范中的 12.7.3.2)

Fields can have the same fully qualified name, but then their values also have to be the same. (12.7.3.2 in the PDF Specification)

您的文档中可能发生的情况是字段的部分名称相同,但完全限定名称不相同.完全限定名称是通过将字段的名称和祖先对象的名称连接起来形成的,如父部分名称".子部分名称".

What probably happens in your document is that the partial name of the field is the same, but the fully qualified name isn't the same. The fully qualified name is formed by concatenating the name of the field and the name of the ancestor objects, as in "parent partial name"."child partial name".

所以基本上您必须使用完全限定名称来查找字段,或者您需要遍历字段列表以查找文档中的所有字段.

So basically you'll have to use the fully qualified name to find the field, or you need to iterate over the list of fields to find all fields you have in the document.

您可以找到显示特定字段的页面,因为字段使用注释(小部件注释)在页面上显示自身.这些注释确实存在于页面级别的 Annots 数组中.不知道pdfbox有没有方便的功能可以轻松搞定.

You could find the page on which a particular field is displayed as a field uses annotations (widget annotations) to show itself on a page. These annotations do live in an Annots array on the page level. Whether there is a convenience function in pdfbox to do this easily, I don't know.

这篇关于PDFBOX 2.0.18 - 如何遍历 PDF 页面并检索特定字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆