如何缩小tesseract生成的PDF的大小? [英] How to reduce the size of the PDF generated by tesseract?

查看:120
本文介绍了如何缩小tesseract生成的PDF的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的(网络)应用程序的设置如下:我得到用户上传的PDF文件,在它们上运行OCR并向他们显示OCRed PDF.由于所有内容都是在线的,因此最小化生成的PDF文件的大小是减少用户的加载和等待时间的关键.

The setup of my (web) app is the following: I get user uploaded PDF files, I run OCR on them and show them the OCRed PDF. Since everything is online, the minimizing the size of the resulting PDF file is key to reduce loading and wait time for the user.

我从用户那里收到的文件是sample.pdf(我创建了一个包含原始文件以及在此处生成的文件的存档:

The file I receive from the user is sample.pdf (I've created an archive with the original files as well as those that I generate here: https://dl.dropboxusercontent.com/u/1390155/tess-files/sample.zip). I use tesseract 3.04 and do the following:

gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff sample.pdf
tesseract sample.tiff sample-tess -l fra -psm 1 pdf

OCR的结果很好,但是现在生成的PDF的大小约为2.5倍

The result of the OCR is good, but the size of the generated PDF is now about 2.5 times as much

  • 原始pdf文件大小:60k
  • 最终PDF大小:14.7万

所以我问你,如何在保持OCR结果的同时减小生成的PDF的大小?

So I ask you, how can I reduce the size of the generated PDF while keeping the OCR result?

一个明显的解决方案是在生成Tiff时降低分辨率,但我不想这样做,因为它可能会影响OCR结果.

One obvious solution is to reduce the resolution when generating the tiff, but I don’t want to do that as it may affect the OCR result.

我尝试的第二件事是使用ghostscript减小tesseract后的PDF大小:

The second thing I tried was to reduce the PDF size post-tesseract, using ghostscript:

gs -o sample-down-300.pdf   -sDEVICE=pdfwrite   -dDownsampleColorImages=true \
   -dDownsampleGrayImages=true   -dDownsampleMonoImages=true  \
   -dColorImageResolution=300   -dGrayImageResolution=300  \
   -dMonoImageResolution=300   -dColorImageDownsampleThreshold=1.0  \
   -dGrayImageDownsampleThreshold=1.5   -dMonoImageDownsampleThreshold=1.0 \
    sample-tess.pdf 

这有一点帮助,生成的文件只有101K,大约是原始文件的1.5倍.我可以接受,但是它似乎也会影响OCR的结果.例如,餐厅"和比萨饼"(第二行)之间的空白现在消失了.

This helps a bit, the generated file is only 101K, so about 1.5 times the original. I could live with that, but it also seems to affect the OCR result. For example, the white space between ‘RESTAURANT’ and ‘PIZZERIA’ (second line) is now missing.

使用ebook参数,带有ghostscript的另一个(更简单)选项会导致43k文件的PDF质量降低,并且存在缺少空白的相同问题:

Another (simpler) option with ghostscript, using the ebook parameter, results in a 43k file with some lesser quality in the PDF and the same problem of the missing white spaces:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
    -dNOPAUSE -dBATCH  -dQUIET -sOutputFile=sample-ebook.pdf \
     sample-tess.pdf

较低质量的PDF很好,但是,我真的不想在OCR上做出妥协.

The lesser quality of the PDF is fine, but again, I don’t really want to compromise on the OCR.

我已经使用PNG和JPEG进行了其他测试,但是OCR结果始终会下降(甚至略有下降),并且PDF不会变小.例如,使用PNG:

I’ve done other tests using PNG and JPEGs, but the OCR results always go down (even slightly) and the resulting PDF is not smaller. For example, with PNG:

convert -density 300 sample.pdf -transparent white sample.png
tesseract sample.png sample-tess-png -l fra -psm 1 pdf

缺少总数(55.50),最终PDF大小为149k.

The total (55.50) is missing and the final PDF size is 149k.

总而言之,这是我的问题:

So to summarize, here are my questions:

  • 有人可以解释为什么使用以下方法减小PDF的大小吗? ghostscript是否会影响OCR结果?我认为文字层和 图像层是独立的...
  • 有没有人可以给的选择 tesseract在生成图像时降低图像质量 PDF?
  • 我读到其他解决方案(例如ABBYY OCR)使用混合栅格化 内容(MRC)可以减小文件大小. tesseract会这样做吗 已经?如果没有,是否有一些开源或专有的CLI工具 这样做,我可以用来减少生成的tesseract PDF 文件?
  • Can someone explain why reducing the size of the PDF using ghostscript affects the OCR result? I thought the text layer and the image layer were independent...
  • Are there options that one can give to tesseract to reduce the quality of the images when it generates the PDF?
  • I read that other solutions like ABBYY OCR use Mixed Rasterized Content (MRC) to reduce the file size. Does tesseract do that already? If not, are there some open source or proprietary CLI tools that do that, which I could use to reduce the tesseract generated PDF file?

同样,只要用户可以搜索文本并选择要从PDF复制/粘贴的文本,就可以保证PDF图像的质量(尽管理想情况下我希望保持颜色不变).

Again, I’m OK compromising on the quality of the PDF images (although I would like to keep the colors, ideally) as long as the user can search text and select it to copy/paste from the PDF.

任何帮助都将不胜感激!

Any help greatly appreciated!

推荐答案

由于您使用的是Tesseract 3.04,因此它支持您可能需要检出的各种压缩模式.

Since you use Tesseract 3.04, it supports various compression modes that you may want to check out.

  --force-transcode=[true|false]
  --force-lossless=[true|false]
  --force-compression-algorithms=[dct|flate|g4|lzw|jpx|jbig2]

问题 1285 1300 .

这篇关于如何缩小tesseract生成的PDF的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆