从投资组合pdf java中提取文件夹 [英] extract folders from portfolio pdf java

查看:94
本文介绍了从投资组合pdf java中提取文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含文件夹,子文件夹和文件的PDF文件夹。我需要使用java中的iText提取与文件夹,子文件夹和文件相同的结构。我只收到带有EMBEDEDFILES的文件。什么是获取文件夹的方式。

I have a portfolio pdf with folders,subfolders and files. I need to extract the same structure as it is with folders,subfolders and files using iText in java. I am getting only files with EMBEDEDFILES. what is way of fetch folders also.

请找到我正在使用的代码。此代码仅为我提供文件夹中的文件。

Kindly find code that i am using. This code only give me files present inside the folders.

public static void extractAttachments(String src, String dir) throws         IOException
{
    File folder = new File(dir);
    folder.mkdirs();

    PdfReader reader = new PdfReader(src);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println(""+names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println(""+embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    System.out.println(filespecs.getAsString(root1));
    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folder, filespecs.getAsString(i++),
                filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, File dir, PdfString name, PdfDictionary filespec)
        throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfArray parent;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);
    //System.out.println(""+refs.getKeys().toString());

    for (Object key : refs.getKeys())
    {
        stream = (PRStream)         PdfReader.getPdfObject(refs.getAsIndirectObject((PdfName) key));

        filename = filespec.getAsString((PdfName) key).toString();

        // System.out.println("" + filename);
        fos = new FileOutputStream(new File(dir, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}


推荐答案

OP尝试复制的文件夹结构在 ISO 32000的Adobe®补充,BaseVersion:1.7,ExtensionLevel:3 。因此,是当前PDF标准的一部分,因此,不需要PDF处理软件来理解这种信息。看起来好像是为了即将推出的PDF-2(ISO 32000-2)标准而添加的。

The folder structure the OP tries to replicate while extracting portfolio files is specified in the Adobe® Supplement to the ISO 32000, BaseVersion: 1.7, ExtensionLevel: 3. Thus, it is not part of the current PDF standard and, therefore, PDF processing software is not required to understand this kind of information. It looks like being scheduled for addition to the upcoming PDF-2 (ISO 32000-2) standard, though.

因此,要将项目组合文件提取到相关的文件夹结构中,我们必须检索Adobe®Supplement中指定的文件夹信息:

To extract portfolio files into the associated folder structure, therefore, we have to retrieve the folder information as specified in the Adobe® Supplement:


从扩展级别3开始,可移植集合可以包含文件夹对象,用于
将文件组织成分层结构。该结构由一个树表示,其中一个根文件夹
充当集合中所有其他文件夹和文件的共同祖先。单根文件夹是
,在第28页的表8.6的文件夹条目中引用。

表8.6c描述文件夹字典中的条目

Table 8.6c describes the entries in a folder dictionary


  • ID integer (必需; ExtensionLevel 3)表示唯一文件夹标识号的非负整数值
    。两个文件夹
    不得共享相同的 ID 值。

  • ID integer (Required; ExtensionLevel 3) A non-negative integer value representing the unique folder identification number. Two folders shall not share the same ID value.

文件夹 ID 值显示为与此文件夹关联的任何文件
的名称树键的一部分。文件夹和文件之间的
关联的详细说明可以在此表后找到。

The folder ID value appears as part of the name tree key of any file associated with this folder. A detailed description of the association between folder and files can be found after this table.

名称文本字符串(必需; ExtensionLevel 3)表示文件夹
名称的文件名。在案例规范化之后,两个兄弟文件夹不得共享相同名称

Name text string (Required; ExtensionLevel 3) A file name representing the name of the folder. Two sibling folders shall not share the same name following case normalization.

字典(如果文件夹有任何后代,则为必需; ExtensionLevel 3)
对此文件夹的第一个子文件夹的间接引用。

Child dictionary (Required if the folder has any descendents; ExtensionLevel 3) An indirect reference to the first child folder of this folder.

下一步字典(除了每个级别的最后一项以外的所有项目都需要; ExtensionLevel 3)
间接引用下一个项目这个级别的兄弟文件夹。

Next dictionary (Required for all but the last item at each level; ExtensionLevel 3) An indirect reference to the next sibling folder at this level.

(第8.2.4节收藏)

例如像这样:

static Map<Integer, File> retrieveFolders(PdfReader reader, File baseDir) throws DocumentException
{
    Map<Integer, File> result = new HashMap<Integer, File>();

    PdfDictionary root = reader.getCatalog();
    PdfDictionary collection = root.getAsDict(PdfName.COLLECTION);
    if (collection == null)
        throw new DocumentException("Document has no Collection dictionary");
    PdfDictionary folders = collection.getAsDict(FOLDERS);
    if (folders == null)
        throw new DocumentException("Document collection has no folders dictionary");

    collectFolders(result, folders, baseDir);

    return result;
}

static void collectFolders(Map<Integer, File> collection, PdfDictionary folder, File baseDir)
{
    PdfString name = folder.getAsString(PdfName.NAME);
    File folderDir = new File(baseDir, name.toString());
    folderDir.mkdirs();
    PdfNumber id = folder.getAsNumber(PdfName.ID);
    collection.put(id.intValue(), folderDir);

    PdfDictionary next = folder.getAsDict(PdfName.NEXT);
    if (next != null)
        collectFolders(collection, next, baseDir);
    PdfDictionary child = folder.getAsDict(CHILD);
    if (child != null)
        collectFolders(collection, child, folderDir);
}

final static PdfName FOLDERS = new PdfName("Folders");
final static PdfName CHILD = new PdfName("Child");

(摘自 PortfolioFileExtraction.java

并在编写文件时使用这些检索到的文件夹信息。

and use these retrieved folder information when writing the files.

文件和文件夹的关联在Adobe®Supplement中指定像这样:

The association of files and folders is specified in the Adobe® Supplement like this:


如前所述, EmbeddedFiles 名称树中的文件是关联的使用应用于名称树键字符串的特殊
命名约定的文件夹。符合以下规则的字符串提供
以将相应文件与文件夹相关联:

As previously mentioned, files in the EmbeddedFiles name tree are associated with folders by a special naming convention applied to the name tree key strings. Strings that conform to the following rules serve to associate the corresponding file with a folder:


  • 名称树键是PDF文本字符串。

  • 第一个字符(不包括任何字节顺序标记)是U + 003C,LESS-THAN SIGN(<)。

  • 以下字符应为一个或多个数字(0到9),后跟关闭U + 003E,
    GREATER-THAN SIGN(>)

  • 字符串的其余部分是文件名。

  • The name tree keys are PDF text strings.
  • The first character, excluding any byte order marker, is U+003C, the LESS-THAN SIGN (<).
  • The following characters shall one or more digits (0 to 9) followed by the closing U+003E, the GREATER-THAN SIGN (>)
  • The remainder of the string is a file name.

由LESS-THAN SIGN GREATER-THAN SIGN(<>)包围的字符串部分被解释为
数值,指定与文件关联的文件夹的ID值。值
对应于文件夹ID。文件夹ID标记后面的字符串部分表示嵌入文件的
文件名。

The section of the string enclosed by LESS-THAN SIGN GREATER-THAN SIGN(<>) is interpreted as a numeric value that specifies the ID value of the folder with which the file is associated. The value shall correspond to a folder ID. The section of the string following the folder ID tag represents the file name of the embedded file.

EmbeddedFiles名称树中不符合这些文件的文件规则应被视为与根文件夹相关联的

Files in the EmbeddedFiles name tree that do not conform to these rules shall be treated as associated with the root folder.

(第8.2.4节收集)

您的方法可以扩展为这样:

Your methods can be extended to do so like this:

public static void extractAttachmentsWithFolders(PdfReader reader, String dir) throws IOException, DocumentException
{
    File folder = new File(dir);
    folder.mkdirs();

    Map<Integer, File> folders = retrieveFolders(reader, folder);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println("" + names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println("" + embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folders, folder, filespecs.getAsString(i++), filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, Map<Integer, File> dirs, File dir, PdfString name, PdfDictionary filespec) throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);

    File dirHere = dir;
    String nameString = name.toUnicodeString();
    if (nameString.startsWith("<"))
    {
        int closing = nameString.indexOf('>');
        if (closing > 0)
        {
            int folderId = Integer.parseInt(nameString.substring(1, closing));
            File folderFile = dirs.get(folderId);
            if (folderFile != null)
                dirHere = folderFile;
        }
    }

    for (PdfName key : refs.getKeys())
    {
        stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));

        filename = filespec.getAsString(key).toString();

        fos = new FileOutputStream(new File(dirHere, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}

(摘自 PortfolioFileExtraction.java

将这些方法应用于您的样本PDF(例如,使用测试方法 testSamplePortfolio11Folders in < a href =https://github.com/mkl-public/testarea-itext5/blob/master/src/test/java/mkl/testarea/itext5/extract/PortfolioFileExtraction.java =nofollow> PortfolioFileExtraction。 java )一个得到

Applying these methods to your sample PDF (e.g. using the test method testSamplePortfolio11Folders in PortfolioFileExtraction.java) one gets

Root
│   ThumbImpression.pdf
│
├───Folder 1
│   │   EStampPdf.pdf
│   │   Presentation.pdf
│   │
│   ├───Folder 11
│   │   │   Test.pdf
│   │   │
│   │   └───Folder 111
│   └───Folder 12
└───Folder 2
        SealDeed.pdf

这篇关于从投资组合pdf java中提取文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆