如何使用iTextSharp从PDF中提取高亮度的文本? [英] How to extract highlighed text from PDF using iTextSharp?

查看:224
本文介绍了如何使用iTextSharp从PDF中提取高亮度的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据以下帖子:
iTextSharp PDF使用C#读取高亮度文本(突出显示注释)

As per folowing post: iTextSharp PDF Reading highlighed text (highlight annotations) using C#

此代码:

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

正致力于提取PDF注释。但是为什么相同的下面的代码不能用于突出显示(特别是PdfName.HIGHLIGHT不起作用):

is working to extract PDF annotations. But why the same following code is not working for highlight (specifically PdfName.HIGHLIGHT is not working) :

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}


推荐答案

请查看ISO-32000-1中的表30(又名PDF参考)。它的标题是页面对象中的条目。在这些条目中,您可以找到名为 Annots 的密钥。它的值是:

Please take a look at table 30 in ISO-32000-1 (aka the PDF reference). It is entitled "Entries in a page object". Among these entries, you can find a key named Annots. Its value is:


(可选)一个注释字典数组,它应包含
对页面关联的所有注释的间接引用(参见
12.5,注释)。

(Optional) An array of annotation dictionaries that shall contain indirect references to all annotations associated with the page (see 12.5, "Annotations").

您将找不到包含<$ c等密钥的条目$ c>突出显示,因此当你拥有这一行时,返回的数组为空是正常的:

You will not find an entry with a key such as Highlight, hence it is only normal that the array that is returned is null when you have this line:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

您需要按照以前的方式获取注释:

You need to get the annotations the way you already did:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

现在你需要遍历这个数组并查找 Subtype <的注释/ code>等于突出显示。这种类型的注释列在ISO-32000-1的表169中,标题为注释类型。

Now you need to loop over this array and look for annotations with Subtype equal to Highlight. This type of annotation is listed in table 169 of ISO-32000-1, entitled "Annotation types".

换句话说,您假设页面字典包含条目键突出显示错误,如果您阅读整个规范,您还会发现您一直在做的另一个错误假设。您错误地认为突出显示的文本存储在注释的 Contents 条目中。这表明对注释与页面内容的性质缺乏了解。

In other words, your assumption that a page dictionary contains entries with key Highlight was wrong and if you read the whole specification, you will also discover another false assumption you've been making. You are falsely assuming that the highlighted text is stored in the Contents entry of the annotations. This reveals a lack of understanding about the nature of annotations versus page content.

您要查找的文本存储在页面的内容流中。页面的内容流独立于页面的注释。因此,要获得突出显示的文本,您需要获取存储在突出显示注释中的坐标(存储在 QuadPoints 中你需要使用这些坐标来解析那些坐标处页面内容中出现的文本。

The text you are looking for is stored in the content stream of the page. The content stream of the page is independent of the page's annotations. Hence, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

这篇关于如何使用iTextSharp从PDF中提取高亮度的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆