从PDF文件中提取页码 [英] Extract page number from PDF file

查看:126
本文介绍了从PDF文件中提取页码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF文档,可能是通过从另一个PDF文档中提取少量页面而创建的。我想知道如何获取页码。由于起始页码为572,因此完整的PDF文档应为1.

I have a PDF document which might have been created by extracting few pages from another PDF document. I am wondering How do I get the page number. As the starting page number is 572, which for a complete PDF document should have been 1.

您认为将PDF转换为XMl会对此问题进行排序吗?

Do you think converting the PDF into an XMl will sort this issue?

推荐答案

最后用iText搞清楚了。如果没有Bovrosky的暗示,那是不可能的。非常感谢他。发布代码示例:

Finally figured it out using iText. Would not have been possible without Bovrosky's hint. Tons of thanks to him. Posting the code sample:

public void process(PdfReader reader) {
    PRIndirectReference obj = (PRIndirectReference) dict.get(com.itextpdf.text.pdf.PdfName.PAGELABELS);
    System.out.println(obj.getNumber());
    PdfObject ref = reader.getPdfObject(obj.getNumber());
    PdfArray array = (PdfArray)((PdfDictionary) ref).get(com.itextpdf.text.pdf.PdfName.NUMS);
    System.out.println("Start Page: " + resolvePdfIndirectReference(array, reader));
}

private static int resolvePdfIndirectReference(PdfObject obj, PdfReader reader) {
    if (obj instanceof PdfArray) {
        PdfDictionary subDict = null;
        PdfIndirectReference indRef = null;
        ListIterator < PdfObject > itr = ((PdfArray) obj).listIterator();
        while (itr.hasNext()) {
            PdfObject pdfObj = itr.next();
            if (pdfObj instanceof PdfIndirectReference)
                indRef = (PdfIndirectReference) pdfObj;
            if (pdfObj instanceof PdfDictionary) {
                subDict = (PdfDictionary) pdfObj;
                break;
            }
        }
        if (subDict != null) {
            return resolvePdfIndirectReference(subDict, reader);
        } else if (indRef != null)
            return resolvePdfIndirectReference(indRef, reader);
    } else if (obj instanceof PdfIndirectReference) {
        PdfObject ref = reader.getPdfObject(((PdfIndirectReference) obj).getNumber());
        return resolvePdfIndirectReference(ref, reader);
    } else if (obj instanceof PdfDictionary) {
        PdfNumber num = (PdfNumber)((PdfDictionary) obj).get(com.itextpdf.text.pdf.PdfName.ST);
        return num.intValue();
    }
    return 0;
}

这篇关于从PDF文件中提取页码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆