编辑使用iTextSharp的现有的PDF文件 [英] Edit an existing PDF file using iTextSharp

查看:316
本文介绍了编辑使用iTextSharp的现有的PDF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我使用下面的编码。

I have a pdf file which I am processing by converting it into text using the following coding..

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));



在处理过程中,如果我看到任何类型的含糊不清的内容是指在PDF中的数据错误文件,我必须标记PDF的整条生产线(颜色与红色线)文件,但我不能来分析如何实现这一点。请帮我。

During processing if I am seeing any type of ambiguity in the content means error in the data of the PDF file, I have to mark the entire line of the pdf(Color that line with Red) file but I am not able to analyze how to achieve that. Please help me.

推荐答案

作为评论已经提到:什么你基本上需要的是一个 SimpleTextExtractionStrategy 替代它不仅返回文本,而是与立场文本。在 LocationTextExtractionStrategy 将是一个很好的起点,因为它会收集位置(把它放在正确的顺序)的文本。

As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).

如果你看看的 LocationTextExtractionStrategy 来源你会发现它保持其文字作品中的成员清单< TextChunk> locationalResult 。 A TextChunk (在内部类 LocationTextExtractionStrategy )表示文本块(最初由一个单一的文本绘图操作绘制)与位置信息。在 GetResultantText 这个列表进行排序(由上到下,由左到右,都相对于文本基线),并减少为一个字符串。

If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List<TextChunk> locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.

您需要什么,是这样的 LocationTextExtractionStrategy 与您检索(排序)文字作品的不同的包括其位置

What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.

不幸的是, locationalResult 成员私人。如果它至少保护,你可以简单地得出你的新战略从 LocationTextExtractionStrategy 。相反,你现在可以复制源添加到它(或做一些反省/反射魔法)。

Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).

您除了将类似于<$ C $的新方法C> GetResultantText 。这种方法可能会识别在同一行(就像 GetResultantText 那样),要么

Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either

    $ B $的所有文本b
  • 做自己模棱两可的分析/搜索和返回位置的列表中发现任何模棱两可(开始和结束);或

  • do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or

把找到当前行的文本到一个单一的 TextChunk 实例连同有效启动而该行的结束位置,并最终返回列表与LT; TextChunk> 每个代表一个文本行;如果你这样做,调用代码会做分析查找歧义,如果找到一个,它拥有该行的模糊性是对的开始和结束位置。当心, TextChunk 在原来的策略是保护,但你需要使它公共这种方法工作。

put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List<TextChunk> each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.

无论哪种方式,你最终拥有的开始和结束位置模糊度或至少线的模糊度上。现在你有突出问题的线路(如你所说,你的必须标记PDF的整条生产线(颜色与红色线)的)。

Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).

要操纵你使用 PdfStamper 给定的PDF文件。您可以通过

To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either


  • 获得的 UnderContent 的该页面从标记在页面上线 PdfStamper 并填写使用您的位置数据的红色有一个矩形;这种方法的这个缺点是,如果原始PDF已经已经underlayed与填充区域的线,你的标记将被隐藏据此;或者通过

  • getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by

获得的 OverContent 的从该页面的 PdfStamper 和填写红色有点透明的矩形;或者通过

getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by

增加一个亮点的注释的页面。

adding a highlight annotation to the page.

为了让事情变得更流畅,你可能想扩展您的 TextChunk (内部类在<$ C副本$ C> LocationTextExtractionStrategy ),不仅保持基线的坐标,也极大提升和使用的字形的血统。显然,你必须填写在 RenderText ...

To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...

这些信息,否则你完全知道需要为您标记的矩形高度。

Doing so you know exactly the height required for your marking rectangle.

这篇关于编辑使用iTextSharp的现有的PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆