从PDF提取图像,如何处理JBIG2编码 [英] Extract images from PDF, how to handle JBIG2 encoded

查看:430
本文介绍了从PDF提取图像,如何处理JBIG2编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆PDF文件,其中有些是纯文本文件,有些则全部或部分保存为每页一张图像",因为它们是由扫描仪生成的.

I have a bunch of PDF files, some of them are pure text but some are fully or partially saved as "One image per page" because they are generated from a scanner.

我需要提取PDF中包含的所有图像,然后分别检查每个图像.

I need to extract all images contained in the PDF and then examine each image separately.

我能够使用在这里看到的python脚本提取大部分图像

I was able to extract most of the images with a python script found here in SO see question:

从PDF提取图像而没有重新采样,在python中?

其中包含的某些图像是使用JBIG2编码的,我找不到任何python或其他工具来将jbig2转换为可以通过通用图形工具轻松打开的图像.

Some of the included images were encoded using JBIG2 and I could not find any python or other tool to convert jbig2 into something that could be easily opened with generic graphic tool.

推荐答案

好几个星期以来,我一直在为此苦苦挣扎,来自SO的许多答案帮助我解决了这个问题,但是始终缺少某些东西,显然这里没有人遇到过jbig2编码图像出现问题.

Well I have been struggling with this for many weeks, many answers from SO helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.

在我要扫描的一堆PDF中,用jbig2编码的图像非常受欢迎.

In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.

据我了解,有很多复印/扫描机可以扫描纸张并将其转换为包含jbig2编码图像的PDF文件.

As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.

因此,经过许多天的测试,决定去寻求建议的答案

So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.

这是我在Linux上的分步指南 :(如果您有其他操作系统,我建议使用Linux docker,它将变得更加容易.)

Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)

第一步:

apt-get install poppler-utils 然后,我可以像这样运行名为pdfimages的命令行工具:

apt-get install poppler-utils Then I was able to run command line tool called pdfimages like this:

pdfimages -all myfile.pdf ./images_found/

使用上述命令,您将能够提取 myfile.pdf中包含的所有图像,并将它们保存在images_found中(必须先创建images_found)

With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)

在列表中,您可以找到几种类型的图像(取决于您的pdf),例如:png,jpg,tiff;所有这些都可以通过任何图形工具轻松读取.

In the list you could find several types of images (depends on you pdf) like: png, jpg, tiff; all these are easily readable with any graphic tool.

然后,您将拥有一些名为-145.jb2e和-145.jb2g的文件.

Then you will have some files named like: -145.jb2e and -145.jb2g.

这2个文件包含一个用jbig2编码的图像,它被保存在2个不同的文件中 ,其中一个用于标题,另一个用于数据

These 2 files contain ONE IMAGE encoded in jbig2 which is saved in 2 different files one for the header and one for the data

同样,我花了很多天试图找出如何将那些文件转换为可读的文件,最后我遇到了名为 jbig2dec

Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec

因此,首先您需要安装此魔术工具:

So first you need to install this magic tool:

易于安装jbig2dec

然后您可以运行:

jbig2dec -t png -145.jb2g -145.jb2e

您最终将能够将所有提取的图像转换成有用的东西.

You are going to finally be able to get all extracted images converted into something useful.

祝你好运!

这篇关于从PDF提取图像,如何处理JBIG2编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆