性能iText vs.PdfBox [英] Performance iText vs.PdfBox

查看:938
本文介绍了性能iText vs.PdfBox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将pdf(我最喜欢的书籍Effective Java,如果它的问题)转换为文本,我检查了iText和Apache PdfBox。我发现性能有很大差异:使用iText需要2:521,使用PdfBox:6:117。
如果我的代码为PdfBOx

I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx

PDFTextStripper stripper = new PDFTextStripper();
BUFFER.append(stripper.getText(PDDocument.load(pdf)));

这是针对iText的

PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
  BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
}

我的问题在于性能取决于,有没有办法如何制作PdfBox更快?或者只使用iText?你能解释一下策略如何影响绩效吗?

My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

推荐答案


我的问题在于表现如何取决于,有没有办法让PdfBox更快?

My question is in what the performance depends, is there a way how to make PdfBox faster?

一个主要区别是PDFBox总是按字形处理文字字形,而iText通常通过块处理它的块(即文本绘制操作的单个字符串参数);这大大减少了iText所需的资源。此外,面向事件的iText文本解析架构意味着资源负担比PDFBox低。并且PDFBox保留了较长时间可用于纯文本提取的严格要求的信息,从而花费更多资源。

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

但是库最初加载文档的方式也可能有所不同。在这里你可以试验一下,PDFBox不仅提供多个 PDDocument.load 重载,还有一些 PDDocument.loadNonSeq 重载(实际上 PDDocument.loadNonSeq 正确读取文档,而 PDDocument.load 可能被欺骗以误解PDF。所有这些不同的变体可能具有不同的运行时行为。

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.


更多关于策略如何影响性能的信息?

more about how strategies affect performance?

iText带来了一种简单而更高级的文本提取策略。简单的假设页面内容流中的文本以阅读顺序显示,而更高级的文本排序。默认情况下,使用更高级的一个。因此,您可以通过使用简单的策略来加速iText甚至更多。 PDFBox总是排序。

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

这篇关于性能iText vs.PdfBox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆