使用MCID内容获取标记的内容 [英] Get marked content using the MCID content

查看:105
本文介绍了使用MCID内容获取标记的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText重新创建Acrobat的标记树"功能.

I am using iText to recreate the Tag Tree feature of Acrobat.

到目前为止,我已经设法获得标签结构.

So far I have managed to get the tag structure.

我要弄清楚的最后一件事是如何获得&从内容流解码标签的标记内容".

The final thing I am trying to figure out is how to get & decode the "Marked Content" for a tag from the content stream.

修改:附加用途

这个问题的目的是弄清楚如何使用mcid访问内容流并解码内容.

The intent of this question is to figure out how to access the content streams, with a mcid, and decode the content.

编辑2 :添加iText RUPS参考

Edit 2: Add iText RUPS reference

下图显示了我在树中到达的位置,红线指向MCID,我正在尝试获取其内容.

Below image shows where I have reached in the tree, the red line points to a MCID, I am trying to get it's content.

编辑3 :添加当前构建树的代码

Edit 3: Add current code that builds a tree

private void manipulate(PdfDictionary element, ItemCollection items)
    {
        if (element == null)
        {
            return;
        }

        ICollection<PdfName> val = element.KeySet();
        PdfObject tagName = element.Get(PdfName.S);
        PdfObject elementType = element.Get(PdfName.Type);

        string tn = "";

        if (tagName != null)
        {
            tn = ((PdfName)tagName).GetValue();
        }
        else
        {
            tn = ((PdfName)elementType).GetValue();
        }

        TreeViewItem tvI = new TreeViewItem() { Header = tn, IsExpanded = true };
        items.Add(tvI);

        PdfArray kids = element.GetAsArray(PdfName.K);
        if (kids == null)
        {
            return;
        }
        for (int i = 0; i < kids.Size(); i++)
        {
            PdfDictionary child = kids.GetAsDictionary(i); //Code change required here to detect MCID & get content, this line returns null when child is a MCID
            manipulate(child, tvI.Items);
        }
    }
}

编辑4 :之所以这样做是为了重新创建Acrobat的标记树"功能.

Edit 4: Reason for this is to recreate the "Tag Tree" feature of Acrobat.

推荐答案

基于您添加到问题中的标签,我看到您正在添加iText7.iText7具有名为

Based on the tags you added to the question, I see that you are adding iText 7. iText 7 has a class named TaggedPdfReaderTool. This class can be used to convert Tagged PDF files to XML:

FileOutputStream outXml = new FileOutputStream("pdf_content.xml");
TaggedPdfReaderTool tool = new TaggedPdfReaderTool(document);
tool.setRootTag("root");
tool.convertToXml(outXml);
outXml.close();

XML将具有与您已经能够提取的标签结构"相同的结构. XML标签内的内容将与PDF内容流中标记为标签的一部分"的内容相对应.

The XML will have the same structure are the "tag structure" you were already able to extract. The content inside the XML tags will correspond with the content that is marked as "part of a tag" in the PDF content stream.

给其他读者的重要消息:问题中的屏幕截图清楚地显示了PDF带有标签.如果您在未标记的PDF上尝试使用此代码段,则无法将内容转换为PDF.

Important message to other readers: the screen shot in the question clearly shows that the PDF is tagged. If you try this code snippet on a PDF that isn't tagged, you won't be able to convert the content to PDF.

更新:较低级别的方法

您还可以像这样检查结构树的所有部分:process(document.getStructTreeRoot());

You can also examine all the parts of the structure tree like this: process(document.getStructTreeRoot());

process()方法的外观如下:

public static void process(IPdfStructElem elem) {
    if (elem == null) return;
    System.out.println(elem.getRole());
    System.out.println(elem.getClass().getName());
    if (elem instanceof PdfStructElem) {
        processStructElem((PdfStructElem) elem);
    }
    if (elem.getKids() == null) return;
    for (IPdfStructElem structElem : elem.getKids()) {
        process(structElem);
    }
}

public static void processStructElem(PdfStructElem elem) {
    PdfDictionary page = elem.getPdfObject().getAsDictionary(PdfName.Pg);
    if (page == null) return;
    PdfStream contents = page.getAsStream(PdfName.Contents);
    if (contents != null) {
        System.out.println(new String(contents.getBytes()));
    }
    PdfArray array = page.getAsArray(PdfName.Contents);
    System.out.println(array);
}

请注意,页面的/Contents可以引用单个流,也可以引用流的数组.在这个简短的代码片段中,我忽略了存储在流数组中的所有/Contents.

Note that the /Contents of a page can refer to a single stream, or to an array of streams. In this short snippet, I ignored all /Contents stored in an array of streams.

这是在用于测试的带标签的PDF上执行时显示的内容示例:

This is an example of the content that was revealed when executing this on a tagged PDF we use for tests:

EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 432.34 184.23 27.98 re
f
Q
EMC
/Span <</MCID 13>> BDC
q
BT
/F2 12 Tf
42 442.65 Td
1 1 1 rg
(The Library)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 399.11 184.23 27.98 re
f
Q
EMC
/Span <</MCID 14>> BDC
q
BT
/F2 12 Tf
42 409.42 Td
1 1 1 rg
(The Company)Tj
ET
Q
EMC
/Span <</MCID 15>> BDC
q
BT
/F1 20 Tf
227.73 472.71 Td
(The Library)Tj
ET
Q
EMC
/Span <</MCID 16>> BDC
q
BT
/F2 12 Tf
229.23 440.45 Td
(iText is a software developer toolkit that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 17>> BDC
q
BT
/F2 12 Tf
229.23 424.46 Td
(functionalities within their applications, processes or products.)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
605.03 262.75 191.73 235.31 re
f
Q
EMC
/Span <</MCID 18>> BDC
q
BT
/F1 16 Tf
676.45 482.5 Td
0.97647 0.76078 0.15294 rg
(What?)Tj
ET
Q
EMC
/Span <</MCID 19>> BDC
q
BT
/F2 12 Tf
607.94 453.08 Td
1 1 1 rg
(iText is a software developer toolkit)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 20>> BDC
q
BT
/F2 12 Tf
611.61 437.09 Td
1 1 1 rg
(that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 21>> BDC
q
BT
/F2 12 Tf
634.95 421.11 Td
1 1 1 rg
(functionalities within their)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 22>> BDC
q
BT
/F2 12 Tf
669.96 405.12 Td
1 1 1 rg
(applications)Tj
ET
Q
EMC
/Span <</MCID 23>> BDC
q
BT
/F1 16 Tf
679.12 381.5 Td
0.97647 0.76078 0.15294 rg
(How?)Tj
ET
Q
EMC
/Span <</MCID 24>> BDC
q
BT
/F2 12 Tf
613.94 352.08 Td
1 1 1 rg
(By providing you with the tools to)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 25>> BDC
q
BT
/F2 12 Tf
607.59 336.09 Td
1 1 1 rg
(create and manipulate a pdf in your)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 26>> BDC
q
BT
/F2 12 Tf
668.96 320.11 Td
1 1 1 rg
(source code)Tj
ET
Q
EMC
/Span <</MCID 27>> BDC
q
BT
/F1 16 Tf
672.44 296.49 Td
0.97647 0.76078 0.15294 rg
(Really?)Tj
ET
Q
EMC
/Span <</MCID 28>> BDC
q
BT
/F2 12 Tf
673.64 267.06 Td
1 1 1 rg
(Yes really!)Tj
ET
Q
EMC

不在BMC/EDCBDC/EDC运算符之间的所有内容均未标记.您正在寻找标有MCID的内容.

Everything that is not between BMC/EDC or BDC/EDC operators is not tagged. You are looking for the content that is marked with an MCID.

在评论中,我解释说最好使用其他方法.最好解析每个页面的内容流(仅一次),并映射结构树中元素遇到的所有对象.

In a comment, I explain that it's better to use a different approach. It is better to parse the content streams of every page (only once) and map all objects you encounter with the elements in the structure tree.

使用这种方法,您必须为每个结构元素一遍又一遍地解析页面的内容流.这需要更多的处理.

With your approach, you have to parse the content stream of a page over and over again for every structure element. That requires much more processing.

这篇关于使用MCID内容获取标记的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆