iTextSharp 4.1.6和5.x版本之间的区别 [英] Difference between iTextSharp 4.1.6 and 5.x versions

查看:1124
本文介绍了iTextSharp 4.1.6和5.x版本之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在开发一个与我们的系统一起使用的Pdf解析器。
要求是这样的,我们将所有信息存储在任何pdf文档上,并且应该能够复制文档(与原始文档相比变化很小)。

We are developing a Pdf parser to be used along with our system. The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).

我们做了一些谷歌搜索,发现iTextSharp是我们目的的最佳伴侣。
我们正在使用.net开发我们的项目。

We did some googling and found iTextSharp be the best mate for our purpose. We are developing our project using .net.

您可能已经猜到了我在标题中提到要求比较特定版本的iTextSharp(4.1.6 vs 5.x)。我们知道4.1.6是具有LGPL / MPL许可证的iTextSharp的最后一个版本。 5.x版本是AGPL。

You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.

我们希望在选择LGPL版本之前对版本进行良好比较,或者我们购买AGPL许可证(我们不喜欢发布我们的代码)。

We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).

我做了一些浏览iTextSharp中的修订更改,但我想知道是否存在任何内容,在版本之间进行了很好的比较。

I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.

提前致谢!

推荐答案

我是首席技术官iText软件,就像Michaël已经在评论部分回答的那样,我同时也是最权威的来源以及有偏见的来源。

I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.

iText网站上有一个非常简单的比较图表: http://itextpdf.com/functionalitycomparison

There's a very simple comparison chart on the iText web site: http://itextpdf.com/functionalitycomparison

此图表不包括文本提取,因此请允许我列出自iText 5以来的相关改进。

This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.

您可能也找到了这个页面: http://itextpdf.com/ salesfaq

You've probably also found this page: http://itextpdf.com/salesfaq

如果您想知道错误修复和文本解析的性能改进,这是一个更详尽的列表:

In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:


  • 5.0.0:文本提取:在用户空间中执行计算的主要大修。这允许解析器正确地确定换行符,即使文本或页面被旋转也是如此。

  • 5.0.1:重构回调,因此随着渲染回调API的发展,方法签名不需要改变。

  • 5.0.1:重构以使外部用户更容易与内容流处理器交互。还重构了渲染侦听器,因此文本和图像事件侦听发生在同一个界面中(减少了很多非增值复杂性)。

  • 5.0.1:文本渲染器的新过滤功能。

  • 5.0.1:用于预览pdf内容的其他实用方法。

  • 5.0.1:添加了一个更高级的文本渲染器侦听器,可以重建页面内容基于页面上文本的物理位置

  • 5.0.1:添加了对XObject表单处理的支持(现在可以解析通过PdfTemplate添加的文本)

  • 5.0.1:为XObject Image回调添加了基本支持

  • 5.0.1:错误修复 - 某些页面方向的文本提取不正确

  • 5.0.1:错误修复 - 矩阵以错误的顺序连接。

  • 5.0.1:PdfTextExtractor:更改了默认的渲染侦听器(新位置感知策略)

  • 5.0.1:图形状态的getters

  • 5.0.2:interfac的重大重构e到文本提取功能:例如类PdfReaderContentParser的介绍

  • 5.0.2:CMapAwareDocumentFont:调整以使处理准无效的PDF文件更加健壮

  • 5.0.2:PdfContentReaderTool:空指针处理,以及一些放置良好的刷新调用

  • 5.0.2:PdfContentReaderTool:显示资源条目的详细信息

  • 5.0.2:PdfContentStreamProcessor:调整因此嵌入式图像不会导致解析问题和EI检测的改进

  • 5.0.2:LocationTextExtractionStrategy:修复反并行算法,加上负数间的计算 - 字符偏移。更改为首先构建文本模型的文本提取策略,然后计算连接要求。

  • 5.0.2:对linesegment实现的调整; Bruno对文本提取所做的更改的最优化;例如:引入类MarkedContentInfo。

  • 5.0.2:对文本提取功能的接口进行重大重构:例如引入类PdfReaderContentParser

  • 5.0.3:以用户单位获取图像区域的附加方法

  • 5.0.3:更好地解析内嵌图像

  • 5.0.3:添加解析ToUnicode流时对开始/结束序列的额外检查。

  • 5.0.4:数组中的内容流应该被解析为好像是用空格分隔

  • 5.0.4:Expose CTM

  • 5.0.4:重构以将内联图像处理拉入其自己的类中。如果没有应用过滤器,则添加图像数据的解析(存在一些PDF,其中图像数据的末尾与EI操作符之间没有空白)。最终,最好实际解析图像数据,但这需要对iText解码器进行相当大的重构(从流而不是已知长度的byte []开始工作)。

  • 5.0.4:处理多级过滤器;纠正将空格作为内联图像流的第一个字节的错误。

  • 5.0.4:将流过滤器应用于内嵌图像。

  • 5.0.4: PdfReader:为任意字节数组(而不仅仅是流)公开过滤器解码器

  • 5.0.6:CMapParser:修复读取损坏的ToUnicode cmaps。

  • 5.0.6:处理略有格式错误的嵌入式图像

  • 5.0.6:CMapAwareDocumentFont:某些PDF的diff映射大于256个字符。

  • 5.0。 6:性能:缓存文本提取中使用的字体

  • 5.1.2:PRTokeniser:使算法找到startxref更高效的内存。

  • 5.1 .2:RandomAccessFileOrArray:改进了对无法映射的大型文件的处理

  • 5.1.2:CMapAwareDocumentFont:修复NPE如果映射没有初始化(我宁愿结束垃圾字符不会引发意外的异常)

  • 5.1.3:重构过滤器如何应用于流,佐剂st解析器,因此它可以处理多级过滤器

  • 5.1.3:images:允许正确解码1bpc位掩码图像

  • 5.1.3:图像:添加jbig2流以通过

  • 5.1.3:images:处理解码参数中的空值和间接引用,如果无法解码图像则抛出异常

  • 5.2.0:更好的错误消息,更好地处理零大小的文件,并尝试读取文件的末尾。

  • 5.2.0:删除了使用内存映射需要文件的限制小于~2GB。

  • 5.2.0:在RandomAccessFileOrArray中避免NullPointerException

  • 5.2.0:在pdfContentStreamProcessor私有中创建一个实用程序方法并澄清类的有状态性

  • 5.2.0:LocationTextExtractionStrategy:检查字符串长度并重构以使代码更易于阅读。

  • 5.2.0 :更好地处理图像中的色彩空间词典。

  • 5.2.0:改进处理o f准不正确的内嵌图像内容。

  • 5.2.0:在我们绝对需要它们之前不要解码内联图像流。

  • 5.2.0:避免提供资源字典的NullPointerException。

  • 5.3.0:LocationTextExtractionStrategy:旧的比较方法在Java 7中导致运行时异常

  • 5.3.3 :合并文本上升参数

  • 5.3.3:公开字形信息

  • 5.3.3:修正:文本到用户空间转换被多次应用于sub-textrenderinfo对象

  • 5.3.3:修正:更正基线计算,使其不包括最终字符间距

  • 5.3.4:为LocationTextExtractionStrategy添加了低级过滤挂钩。

  • 5.3.5:修复了PRTokeniser中的错误:处理数字位于流末尾的情况。

  • 5.3.5:出于性能原因,在PRTokeniser中用StringBuilder替换了StringBuffer。

  • 5.4.2:添加了isChunkAtWordB到LocationTextExtractionStrategy的oundary()方法,检查是否应该在前一个块和当前块之间插入一个空格字符。

  • 5.4.2:在LocationTextExtractionStrategy中添加了一个getCharSpaceWidth()方法来获取

  • 5.4.2:在LocationTextExtractionStrategy中添加了getText()方法以获取当前Chunk的文本。

  • 5.4 .2:向SimpleTextExtractionStrategy添加了appendTextChunk(()方法以公开追加过程,以便子类可以在文本解析操作之外添加文本。

  • 5.4.5:为PDF添加了MultiFilteredRenderListener类解析器。

  • 5.4.5:添加了GlyphRenderListener和GlyphTextRenderListener类来处理每个字形而不是处理文本块。

  • 5.4.5:添加方法TextRenderInfo中的getMcid()。

  • 5.4.5:当内容流中有许多内联图像时发生资源泄漏

  • 5.5.0:CMapAwareDocumentFont:如果没有定义字体空间宽度,请使用字体的默认宽度。

  • 5.5.0:PdfContentReader:显示空字典时避免异常。

  • 5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
  • 5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
  • 5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
  • 5.0.1: New filtering functionality for text renderers.
  • 5.0.1: Additional utility method for previewing pdf content.
  • 5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
  • 5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
  • 5.0.1: Added rudimentary support for XObject Image callbacks
  • 5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
  • 5.0.1: Bug fix - matrices were being concatenated in the wrong order.
  • 5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
  • 5.0.1: Getters for GraphicsState
  • 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
  • 5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
  • 5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
  • 5.0.2: PdfContentReaderTool: Show details on resource entries
  • 5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
  • 5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
  • 5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
  • 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
  • 5.0.3: added method to get area of image in user units
  • 5.0.3: better parsing of inline images
  • 5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
  • 5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
  • 5.0.4: Expose CTM
  • 5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
  • 5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
  • 5.0.4: Applying stream filters to inline images.
  • 5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
  • 5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
  • 5.0.6: handle slightly malformed embedded images
  • 5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
  • 5.0.6: performance: Cache the fonts used in text extraction
  • 5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
  • 5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
  • 5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
  • 5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
  • 5.1.3: images: allow correct decoding of 1bpc bitmask images
  • 5.1.3: images: add jbig2 streams to pass through
  • 5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
  • 5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
  • 5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
  • 5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
  • 5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
  • 5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
  • 5.2.0: Better handling of color space dictionaries in images.
  • 5.2.0: improve handling of quasi improper inline image content.
  • 5.2.0: don't decode inline image streams until we absolutely need them.
  • 5.2.0: avoid NullPointerException of resource dictionary isn't provided.
  • 5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
  • 5.3.3: incorporate the text-rise parameter
  • 5.3.3: expose glyph-by-glyph information
  • 5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
  • 5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
  • 5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
  • 5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
  • 5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
  • 5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
  • 5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
  • 5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
  • 5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
  • 5.4.5: Added MultiFilteredRenderListener class for PDF parser.
  • 5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
  • 5.4.5: Added method getMcid() in TextRenderInfo.
  • 5.4.5: fixed resource leak when many inline images were in content stream
  • 5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
  • 5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.

如果不升级,有些事情是你无法做到的。例如,您将无法执行这些幻灯片中描述的内容: http://www.slideshare.net/iTextPDF/itext-summit-2014-talk-unstructured-pdf

There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides: http://www.slideshare.net/iTextPDF/itext-summit-2014-talk-unstructured-pdf

如果你看一下iText的路线图,你会发现我们将来会在文本提取上投入更多的时间: http://www.slideshare.net/iTextPDF/itext-summit-2014-keynote-talk

If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future: http://www.slideshare.net/iTextPDF/itext-summit-2014-keynote-talk

全部诚实:使用5岁版本不仅会像重新发明轮子一样,也可能就像落入我们在过去5年中陷入困境的每一个陷阱。我可以向您保证,购买许可证会更便宜。

In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.

这篇关于iTextSharp 4.1.6和5.x版本之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆