如何阅读外观流的文字？ [英] How to read text of appearance stream?

查看：154 发布时间：2016/9/21 14:50:11 c# pdf itextsharp itext

本文介绍了如何阅读外观流的文字？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个PDF其中一个注释显示（在Adobe Reader渲染）文本比什么是它的 /目录和 / RC 项赋予不同的。这与我在这个问题处理这个问题：

I have a PDF where the text shown in an annotation (as rendered in Adobe Reader) is different than what is given by its /Contents and /RC entries. This is related to the problem that I was dealing with in this question:

的无法变更注释

的/内容在这种情况下，而不是改变外观以匹配注释的内容，我想这样做相反：获得外观文本和修改 /目录和 / RC 值相匹配。例如，如果注释显示外观和 /目录设置为内容，我想要做的是这样的：

In this case, instead of changing the appearance to match the annotation's contents, I want to do the opposite: get the appearance text and change the /Contents and /RC values to match. E.g., if the annotation displays "appearance" and /Contents is set to "content", I want to do something like:

void setContent(PdfDictionary dict)
{
 PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));
 dict.Put(PdfName.CONTENTS,str);
}

但我找不到外观文本的存储位置。我用的 / AP 此代码引用的字典：

private PdfDictionary getAPAnnot(PdfArray annotArray,PdfDictionary annot)
        {
            PdfDictionary apDict = annot.GetAsDict(PdfName.AP);
            if (apDict!=null)
            {
                PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
                PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);
                return apRefDict;
            }
            else
            {
                return null;
            }
        }

本词典有以下HashMap的：

This dictionary has the following hashMap:

{[/BBox, [-38.7578, -144.058, 62.0222, 1]]} 
{[/Filter, /FlateDecode]}   
{[/Length, 172]}    
{[/Matrix, [1, 0, 0, 1, 0, 0]]} 
{[/Resources, Dictionary]}

/资源有字体间接引用，但没有内容。如此看来，出现流不包括的内容数据。

/Resources has indirect references to the fonts, but no contents. So it seems that the appearance stream doesn't include content data.

除了 /目录和 / RC ，似乎没有要在其存储内容的数据注释的数据结构中的任何地方。我应该在哪里寻找外观的内容？

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

推荐答案

不幸的是，OP没有提供一个样本PDF。考虑到他以前的问题，但是，他是最有可能感兴趣的自由文本注释。因此，我使用的这个例子PDF这里为例。它有一个页面打字机自由文本注释看起来像这样：

Unfortunately the OP has not provided a sample PDF. Considering his previous question, though, he is most likely interested in free text annotations. Thus, I use this example PDF here as example. It has one page with a typewriter free text annotation looking like this:

的

该任择议定书要求

除了 /目录和 / RC ，似乎没有要在注释的数据结构，用于存储内容数据的任何地方。我应该在哪里寻找外观的内容？

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

的OP代码的主要缺点是，他只考虑了正常外观为 PdfDictionary ：

The major shortcoming of the OP's code is that he only considered the normal appearance as PdfDictionary:

PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);

这其实是一个 PdfStream ，即字典，一个数据流，而这个数据流是在那里的外观绘图指令都位于

It actually is a PdfStream, i.e. a dictionary with a data stream, and this data stream is where the appearance drawing instructions are located.

不过，即使在手这个数据流，它不是那么简单想象由OP：

But even with this data stream at hand, it is not as simple as imagined by the OP:

PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));

其实外观流中的文字可以在片绘制，例如在我的示例文件流数据是这样的：

Actually the text in the appearance stream can be drawn in pieces, e.g. in my sample file the stream data look like this:

0 w
131.2646 564.8243 180.008 30.984 re
n
q
1 0 0 1 0 0 cm
131.2646 564.8243 180.008 30.984 re
W
n
0 g
1 w
BT
/Cour 12 Tf
0 g
131.265 587.96 Td
(This ) Tj
35.999 0 Td
(is ) Tj
21.6 0 Td
(written ) Tj
57.599 0 Td
(using ) Tj
43.2 0 Td
(the ) Tj
-158.398 -16.3 Td
(typewriter ) Tj
79.199 0 Td
(tool.) Tj
ET
Q

此外，该编码不需要一些标准编码像在这里，但可代替对即时嵌入字体定义

Furthermore, the encoding does not need to be some standard encoding like here but can instead be defined for an embedded font on-the-fly.

因此，人们不得不申请正式的文本提取

Thus, one has to apply full-fledged text extraction.

这一切可以这样实现的：

This all can be implemented like this:

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    Console.Write("\nPage {0}\n", page);
    PdfDictionary pageDictionary = pdfReader.GetPageNRelease(page);
    PdfArray annotsArray = pageDictionary.GetAsArray(PdfName.ANNOTS);
    if (annotsArray == null || annotsArray.IsEmpty())
    {
        Console.Write("  No annotations.\n");
        continue;
    }
    foreach (PdfObject pdfObject in annotsArray)
    {
        PdfObject direct = PdfReader.GetPdfObject(pdfObject);
        if (direct.IsDictionary())
        {
            PdfDictionary annotDictionary = (PdfDictionary)direct;
            Console.Write("  SubType: {0}\n", annotDictionary.GetAsName(PdfName.SUBTYPE));
            PdfDictionary appearancesDictionary = annotDictionary.GetAsDict(PdfName.AP);
            if (appearancesDictionary == null)
            {
                Console.Write("    No appearances.\n");
                continue;
            }
            foreach (PdfName key in appearancesDictionary.Keys)
            {
                Console.Write("    Appearance: {0}\n", key);
                PdfStream value = appearancesDictionary.GetAsStream(key);
                if (value != null)
                {
                    String text = ExtractAnnotationText(value);
                    Console.Write("    Text:\n---\n{0}\n---\n", text);
                }
            }
        }
    }
}

这种helper方法

public String ExtractAnnotationText(PdfStream xObject)
{
    PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();

    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
    processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
    return strategy.GetResultantText();
}

在上面的示例文件的情况下，该代码的输出是

In case of the sample file above, the output of the code is

Page 1
  SubType: /FreeText
    Appearance: /N
    Text:
---
This is written using the 
typewriter tool.
---

当心，有一些注释，在复选框的特定控件注释和单选按钮，它比这里的代码预期略深的结构。

Beware, there are some annotations, in particular widget annotations of checkboxes and radio buttons, which have a slightly deeper structure than expected by the code here.

这篇关于如何阅读外观流的文字？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何阅读外观流的文字？ [英] How to read text of appearance stream?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何阅读外观流的文字？ [英] How to read text of appearance stream?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭