如何识别扫描的PDF文件中的图像? [英] How to recognize images within scanned PDF files?

查看:242
本文介绍了如何识别扫描的PDF文件中的图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python识别扫描的PDF文件中的图像(而不是文本).有什么办法吗?举一个简单的例子,假设您已经扫描了一本书的一章.页面有三个可能的选项:

I am trying to identify images (as opposed to text) within scanned PDF files, ideally using python. Is there any way to do this? As a simple example, say you've scanned a chapter of a book. There are three possible options for a page:

  1. 仅包含文本
  2. 仅包含一个(或多个)图像
  3. 同时包含文本和图像

我想输出一个类别2或3的页面列表.

I would like to output a list of pages that fall into category 2 or 3.

推荐答案

我的想法是寻找普通文本中不会出现的功能-可能是跨越多行的垂直黑色元素.我选择的工具是 ImageMagick ,它已安装在大多数Linux发行版中,并且可用于macOS和Windows.我只需要在终端的命令提示符下运行它即可.

My idea would be to look for features that do not occur in normal text - which might be vertical, black elements spanning multiple lines. My tool of choice is ImageMagick and it is installed on most Linux distros and is available for macOS and Windows. I would just run it in the Terminal at the command prompt.

因此,我将使用此命令-请注意,我将原始页面添加到了右侧已处理页面的左侧,并在其周围放置了一个红色边框,仅供说明:

So, I would use this command - note that I added the original page to the left of the processed page on the right and put a red border around just for illustration:

magick page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 result.png

我明白了:

page-25.png

page-26.png

page-27.png

page-28.png

上面命令的说明...

在上面的命令中,我没有进行阈值化,而是将颜色还原为2种颜色,然后转换为灰度,然后进行归一化-基本上,应该选择黑色和背景色作为两种颜色,它们将变为黑色和转换为灰度并标准化后为白色.

In the above command, rather than thresholding, I am doing a colour reduction to 2 colours followed by a conversion to greyscale and then normalisation - basically that should choose black and the background colour as the two colours and they will become black and white when converted to greyscale and normalised.

然后我用200像素高的结构化元素做一个中值过滤器,该结构化元素比几行高-因此它应该识别出较高的特征-垂直线.

I am then doing a median filter with a 200 pixel tall structuring element which is taller than a few lines - so it should identify tall features - vertical lines.

说明

继续...

因此,如果我反转图像,使黑色变成白色,白色变成黑色,然后取均值并将其乘以图像中的像素总数,这将告诉我垂直特征的一部分像素数:

So, if I invert the image so black becomes white and white becomes black, and then take the mean and multiply it by the total number of pixels in the image, that will tell me how many pixels are part of vertical features:

convert page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
90224

convert page-27.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
0

因此,第28页不是纯文本,而第27页则是纯文本.

So page 28 is not pure text and page 27 is.

这里有一些提示...

Here are some tips...

提示

您可以看到PDF中有多少页,就像这样-尽管可能有更快的方法:

You can see how many pages there are in a PDF, like this - though there are probably faster methods:

convert -density 18 book.pdf info:

提示

您可以像这样提取PDF页面:

You can extract a page of a PDF like this:

convert -density 288 book.pdf[25] page-25.png

提示

如果您正在制作多本书,则可能需要对图像进行规范化处理,以使它们全都高(例如1000像素),那么结构元素的大小(用于计算中位数)应该相当一致.

If you are doing multiple books, you will probably want to normalise the images so that they are all, say, 1000 pixels tall then the size of the structuring element (for calculating the median) should be fairly consistent.

这篇关于如何识别扫描的PDF文件中的图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆