使用 C# 阅读 PDF 高亮文本(高亮注释) [英] PDF Reading highlighed text (highlight annotations) using C#

查看:47
本文介绍了使用 C# 阅读 PDF 高亮文本(高亮注释)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 iTextSharp 编写了一个提取工具,可以从 PDF 文档中提取注释信息.对于高亮注释,我只得到页面上高亮区域的矩形.

我的目标是提取突出显示的文本.为此,我使用PdfTextExtractor".

Rectangle rect = new Rectangle(pdfArray.GetAsNumber(0).FloatValue,pdfArray.GetAsNumber(1).FloatValue,pdfArray.GetAsNumber(2).FloatValue,pdfArray.GetAsNumber(3).FloatValue);RenderFilter[] filter = { new RegionTextRenderFilter(rect) };ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);返回 textInsideRect;

PdfTextExtractor 返回的结果不完全正确.例如,它返回 "was going to消除纸追逐",即使只有 "eliminate" 被突出显示.

对于包含突出显示的 "eliminate" 的 TJ 的整个文本足够有趣的是 将要消除纸追逐"(TJ 是编写文本的 PDF 指令到页面).

我很想听到有关此问题的任何意见 - 以及不涉及 iTextSharp 的解决方案.

解决方案

原因

<块引用>

有趣的是,包含突出显示的消除"的 TJ 的整个文本是将要消除纸追逐"(TJ 是将文本写入页面的 PDF 指令).

这实际上是您出现问题的原因.iText 解析器类将文本转发到渲染侦听器中,它们在内容流中作为连续字符串找到.您使用的过滤机制过滤这些部分.因此,整个句子被过滤器接受.

因此,您需要的是一些预处理步骤,将这些片段拆分为单独的字符,并将它们分别转发到过滤后的渲染侦听器.

这实际上很容易实现.转发文本片段的参数类型,TextRenderInfo, 提供了一种将自身拆分的方法:

/*** 如果侦听器需要访问文本渲染操作中每个单独字形的位置,则提供有用的详细信息* @return 表示绘制操作中使用的每个字形的 {@link TextRenderInfo} 对象列表.下一个效果是如果呈现字符串中的每个字符都有一个单独的 Tj 操作* @自 5.3.3 起*/公共列表getCharacterRenderInfos()//iText/Java虚拟公共列表GetCharacterRenderInfos()//iTextSharp/.Net

因此,您所要做的就是创建和使用 RenderListener/IRenderListener 实现,它将所有调用转发到另一个侦听器(在您的情况下是过滤的侦听器)) 与 renderText/RenderText 拆分其 TextRenderInfo 参数并逐个转发碎片的扭曲.

Java 示例

由于 OP 要求提供更多详细信息,因此这里提供更多代码.不过,由于我主要使用 Java,因此我在 Java 中为 iText 提供了它.但是 iTextSharp 很容易移植到 C#.

如上所述,需要一个预处理步骤,将文本片段拆分为单独的字符,并将它们单独转发到过滤后的渲染侦听器.

对于这一步你可以使用这个类TextRenderInfoSplitter:

package stackoverflow.itext.extraction;导入 com.itextpdf.text.pdf.parser.ImageRenderInfo;导入 com.itextpdf.text.pdf.parser.TextExtractionStrategy;导入 com.itextpdf.text.pdf.parser.TextRenderInfo;公共类 TextRenderInfoSplitter 实现 TextExtractionStrategy{公共 TextRenderInfoSplitter(TextExtractionStrategy 策略){this.strategy = 策略;}public void renderText(TextRenderInfo renderInfo){for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()){策略.renderText(信息);}}公共无效 beginTextBlock(){策略.beginTextBlock();}公共无效 endTextBlock(){策略.endTextBlock();}public void renderImage(ImageRenderInfo renderInfo){策略.renderImage(renderInfo);}公共字符串 getResultantText(){返回 strategy.getResultantText();}最终的 TextExtractionStrategy 策略;}

如果你有一个 TextExtractionStrategy 策略(比如你的new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)),你现在可以用单字符TextRenderInfo 实例如下:

String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));

我使用在 this 中创建的 PDF 对其进行了测试回答区域

Rectangle rect = new Rectangle(200, 600, 200, 135);

作为参考,我在 PDF 中标记了该区域:

按区域过滤的文本提取没有 TextRenderInfoSplitter 结果:

我正在尝试创建一个包含很多内容的 PDF 文件文档中的文本内容.我是使用 PDFBox

使用TextRenderInfoSplitter按区域过滤的文本提取结果:

 创建一个 PDF f文档中的内容n g P D F

顺便说一句,您在这里看到过早将文本拆分为单个字符的缺点:最后的文本行使用非常大的字符间距排版.如果您将 PDF 中的文本段保持原样,文本提取策略仍然可以轻松看出该行由 usingPDFBox 两个词组成.一旦您将文本片段逐个字符地输入到文本提取策略中,它们就有可能将如此广泛的词解释为许多单字母词.

改进

<块引用>

突出显示的单词eliminate"例如被提取为o消除t".这已通过双击单词突出显示并在 Adob​​e Acrobat Reader 中突出显示.

在我上面的示例中发生了类似的事情,几乎没有触及感兴趣区域的字母会进入结果.

这是由于 allowTextRegionTextRenderFilter 实现允许所有文本继续其基线与所讨论的矩形相交,即使相交仅由一个点组成:

public boolean allowText(TextRenderInfo renderInfo){LineSegment 段 = renderInfo.getBaseline();Vector startPoint = segment.getStartPoint();Vector endPoint = segment.getEndPoint();浮动 x1 = startPoint.get(Vector.I1);浮动 y1 = startPoint.get(Vector.I2);浮动 x2 = endPoint.get(Vector.I1);浮动 y2 = endPoint.get(Vector.I2);返回 filterRect.intersectsLine(x1, y1, x2, y2);}

鉴于您首先将文本拆分为字符,您可能需要检查它们各自的基线是否完全包含在相关区域中,即实现自己的RenderFilter 通过复制 RegionTextRenderFilter 然后替换线

return filterRect.intersectsLine(x1, y1, x2, y2);

return filterRect.contains(x1, y1) &&filterRect.contains(x2, y2);

不过,根据在 Adob​​e Acrobat Reader 中突出显示文本的准确程度,您可能希望以完全自定义的方式进行更改.

I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.

I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.

Rectangle rect = new Rectangle(
    pdfArray.GetAsNumber(0).FloatValue, 
    pdfArray.GetAsNumber(1).FloatValue,
    pdfArray.GetAsNumber(2).FloatValue,
    pdfArray.GetAsNumber(3).FloatValue);

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;

The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.

解决方案

The cause

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.

What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.

This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:

/**
 * Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
 * @return A list of {@link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
 * @since 5.3.3
 */
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net

Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.

A Java sample

As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.

As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.

For this step you can use this class TextRenderInfoSplitter:

package stackoverflow.itext.extraction;

import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;

public class TextRenderInfoSplitter implements TextExtractionStrategy
{
    public TextRenderInfoSplitter(TextExtractionStrategy strategy)
    {
        this.strategy = strategy;
    }

    public void renderText(TextRenderInfo renderInfo)
    {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
        {
            strategy.renderText(info);
        }
    }

    public void beginTextBlock()
    {
        strategy.beginTextBlock();
    }

    public void endTextBlock()
    {
        strategy.endTextBlock();
    }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        strategy.renderImage(renderInfo);
    }

    public String getResultantText()
    {
        return strategy.getResultantText();
    }

    final TextExtractionStrategy strategy;
}

If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:

String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));

I tested it with the PDF created in this answer for the area

Rectangle rect = new Rectangle(200, 600, 200, 135);

For reference I marked the area in the PDF:

Text extraction filtered by area without the TextRenderInfoSplitter results in:

I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox

Text extraction filtered by area with the TextRenderInfoSplitter results in:

 to create a PDF f
ntents in the docu
n g P D F

BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.

An improvement

The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.

Something similar happens in my sample above, letters barely touching the area of interest make it into the result.

This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:

public boolean allowText(TextRenderInfo renderInfo){
    LineSegment segment = renderInfo.getBaseline();
    Vector startPoint = segment.getStartPoint();
    Vector endPoint = segment.getEndPoint();

    float x1 = startPoint.get(Vector.I1);
    float y1 = startPoint.get(Vector.I2);
    float x2 = endPoint.get(Vector.I1);
    float y2 = endPoint.get(Vector.I2);

    return filterRect.intersectsLine(x1, y1, x2, y2);
}

Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own RenderFilter by copying RegionTextRenderFilter and then replacing the line

return filterRect.intersectsLine(x1, y1, x2, y2);

by

return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);

Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.

这篇关于使用 C# 阅读 PDF 高亮文本(高亮注释)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆