PDF阅读使用C#highlighed文本(注释高亮) [英] PDF Reading highlighed text (highlight annotations) using C#

查看:500
本文介绍了PDF阅读使用C#highlighed文本(注释高亮)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写了使用iTextSharp的提取工具,从PDF文档中提取注释信息。为高亮注释,我只得到被突出显示页面上的区域的矩形。

我的目标提取已经突出显示的文本。对于我使用`PdfTextExtractor。

 矩形RECT =新的Rectangle(
    pdfArray.GetAsNumber(0).FloatValue,
    pdfArray.GetAsNumber(1).FloatValue,
    pdfArray.GetAsNumber(2).FloatValue,
    pdfArray.GetAsNumber(3).FloatValue);RenderFilter [] =过滤器{新RegionTextRenderFilter(矩形)};
ITextExtractionStrategy策略=新FilteredTextRenderListener(新LocationTextExtractionStrategy(),过滤器);
字符串textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader,帮您做生意,策略);
返回textInsideRect;

PdfTextExtractor 返回的结果是不完全正确的。比如,它返回是要消灭纸张追逐尽管只有的强调。

有趣的足够包含TJ整个文本高亮显示的是要消灭纸张追逐(TJ是PDF指令写入文本到的网页)。

我很乐意听到关于这个问题的任何输入 - 也解决方案,不涉及iTextSharp的


解决方案

的原因


  

有趣足以让TJ包含整个文本中突出了消就是是要消灭纸张追逐(TJ是PDF指令写入文本页面)。


这实际上是对您的问题的原因。该iText的解析器类转发文字在他们发现的内容流中连续串的作品渲染的听众。您使用的过滤机制过滤这些作品。因此,整个句子由滤波器接受

您需要什么,因此,是分裂这些作品到他们的个人特点和个别转发这些一些pre-处理步骤的过滤渲染监听器。

这其实是相当容易实现。在该文件被转发的参数类型, TextRenderInfo,提供了一个方法,本身分裂:

  / **
 *提供详细有用的,如果一个监听器需要访问每个人字形的文本渲染操作的位置
 返回:{@link TextRenderInfo}对象重新present抽奖操作中使用的每个字形的列表。接下来效果是,如果有一个单独的TJ opertion为渲染字符串中的每个字符
 * @since 5.3.3
 * /
公开名单< TextRenderInfo> getCharacterRenderInfos()// iText的/ Java的
虚拟公开名单< TextRenderInfo> GetCharacterRenderInfos()// iTextSharp的/ .NET

因此​​,你所要做的就是创建和使用的转发所有的呼叫 RenderListener / IRenderListener 实施它得到另一个监听器(你的情况你过滤监听器)与扭曲的 renderText / renderText 拆分其 TextRenderInfo 参数,并转发碎片一个个独立。

的Java样本

由于OP要求更多的细节,在这里一些code。正如我敢predominantly使用Java,不过,我只要它在Java中的iText的。但是很容易移植到C#进行iTextSharp的。

如上一个$ P $需要对处理步骤,文本块分割成其单个字符并分别将它们转发给您的过滤使听众提及。

有关此步骤中,您可以使用这个类 TextRenderInfoSplitter

 包stackoverflow.itext.extraction;进口com.itextpdf.text.pdf.parser.ImageRenderInfo;
进口com.itextpdf.text.pdf.parser.TextExtractionStrategy;
进口com.itextpdf.text.pdf.parser.TextRenderInfo;公共类TextRenderInfoSplitter实现TextExtractionStrategy
{
    公共TextRenderInfoSplitter(TextExtractionStrategy策略)
    {
        this.strategy =策略;
    }    公共无效renderText(TextRenderInfo renderInfo)
    {
        对于(TextRenderInfo信息:renderInfo.getCharacterRenderInfos())
        {
            strategy.renderText(信息);
        }
    }    公共无效beginTextBlock()
    {
        strategy.beginTextBlock();
    }    公共无效endTextBlock()
    {
        strategy.endTextBlock();
    }    公共无效renderImage(ImageRenderInfo renderInfo)
    {
        strategy.renderImage(renderInfo);
    }    公共字符串getResultantText()
    {
        返回strategy.getResultantText();
    }    最后TextExtractionStrategy策略;
}

如果你有一个 TextExtractionStrategy策略(如你的新FilteredTextRenderListener(新LocationTextExtractionStrategy(),过滤器)),你现在可以用单字符喂养它 TextRenderInfo 这样的实例:

 字符串textInsideRect = PdfTextExtractor.getTextFromPage(读卡器,帮您做生意,新TextRenderInfoSplitter(策略));

我在<创建的PDF格式进行了测试href=\"http://stackoverflow.com/questions/20680430/is-it-possible-to-justify-text-in-pdfbox/20681996#20681996\">this回答该地区

 矩形RECT =新的Rectangle(200,600,200,135);

有关参考标志着我在PDF领域:

文本提取由区域无 TextRenderInfoSplitter 结果过滤:

 我想创建一个PDF文件有很多
的文档中的文本内容。我是
使用PDFBox的

文本提取由区与 TextRenderInfoSplitter 结果过滤:

 来创建一个PDF˚F
在实况ntents
ñ磷的克数D F

顺便说一句,你在这里看到的早期分裂文成单个字符的缺点:最后的文本行是使用非常大的字符间距简化字。如果从PDF保持文本段,因为它们是,文本提取的策略仍然很容易可以看到该行的两个单词的使用的和的 PDFBox的的。只要你按字符喂文本段字符到文本提取的策略,他们很可能会间preT等广泛组字尽可能多一个字母的单词。

的改进


  

突出显示的单词消是例如提取作为○消除T。这已经通过双击高亮单词和在Adobe Acrobat Reader高亮


类似的情况我的样本在上面,字母几乎触及感兴趣的区域,使其成为结果。

这是因为 RegionTextRenderFilter 实施 allowText 允许所有文字,继续其基线相交问题的矩形即使交点由仅仅一个单个点的

 公共布尔allowText(TextRenderInfo renderInfo){
    线段段= renderInfo.getBaseline();
    向量的startPoint = segment.getStartPoint();
    矢量终点= segment.getEndPoint();    浮X1 = startPoint.get(Vector.I1);
    浮Y1 = startPoint.get(Vector.I2);
    浮X2 = endPoint.get(Vector.I1);
    浮Y2 = endPoint.get(Vector.I2);    返回filterRect.intersectsLine(X1,Y1,X2,Y2);
}

既然你先拆分文本字符,你可能要检查它们各自的基准线是否被完全包含在有关地区,即实现一个自己
RenderFilter 复制 RegionTextRenderFilter ,然后替换该行

 返回filterRect.intersectsLine(X1,Y1,X2,Y2);

 返回filterRect.contains(X1,Y1)及和放大器; filterRect.contains(X2,Y2);

根据究竟究竟文字的在Adobe Acrobat Reader强调的,不过,你可能想在一个完全自定义的方式来改变这种状况。

I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.

I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.

Rectangle rect = new Rectangle(
    pdfArray.GetAsNumber(0).FloatValue, 
    pdfArray.GetAsNumber(1).FloatValue,
    pdfArray.GetAsNumber(2).FloatValue,
    pdfArray.GetAsNumber(3).FloatValue);

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;

The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.

解决方案

The cause

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.

What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.

This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:

/**
 * Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
 * @return A list of {@link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
 * @since 5.3.3
 */
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net

Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.

A Java sample

As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.

As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.

For this step you can use this class TextRenderInfoSplitter:

package stackoverflow.itext.extraction;

import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;

public class TextRenderInfoSplitter implements TextExtractionStrategy
{
    public TextRenderInfoSplitter(TextExtractionStrategy strategy)
    {
        this.strategy = strategy;
    }

    public void renderText(TextRenderInfo renderInfo)
    {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
        {
            strategy.renderText(info);
        }
    }

    public void beginTextBlock()
    {
        strategy.beginTextBlock();
    }

    public void endTextBlock()
    {
        strategy.endTextBlock();
    }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        strategy.renderImage(renderInfo);
    }

    public String getResultantText()
    {
        return strategy.getResultantText();
    }

    final TextExtractionStrategy strategy;
}

If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:

String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));

I tested it with the PDF created in this answer for the area

Rectangle rect = new Rectangle(200, 600, 200, 135);

For reference I marked the area in the PDF:

Text extraction filtered by area without the TextRenderInfoSplitter results in:

I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox

Text extraction filtered by area with the TextRenderInfoSplitter results in:

 to create a PDF f
ntents in the docu
n g P D F

BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.

An improvement

The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.

Something similar happens in my sample above, letters barely touching the area of interest make it into the result.

This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:

public boolean allowText(TextRenderInfo renderInfo){
    LineSegment segment = renderInfo.getBaseline();
    Vector startPoint = segment.getStartPoint();
    Vector endPoint = segment.getEndPoint();

    float x1 = startPoint.get(Vector.I1);
    float y1 = startPoint.get(Vector.I2);
    float x2 = endPoint.get(Vector.I1);
    float y2 = endPoint.get(Vector.I2);

    return filterRect.intersectsLine(x1, y1, x2, y2);
}

Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own RenderFilter by copying RegionTextRenderFilter and then replacing the line

return filterRect.intersectsLine(x1, y1, x2, y2);

by

return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);

Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.

这篇关于PDF阅读使用C#highlighed文本(注释高亮)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆