使用PIL检测空白页的扫描 [英] Using PIL to detect a scan of a blank page

查看:325
本文介绍了使用PIL检测空白页的扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我经常在一个不聪明的佳能多功能运行巨大的双面扫描作业,这让我有一个巨大的JPEG文件夹。我非常不愿意考虑使用PIL来分析图像文件夹,以检测空白页面的扫描,并标记它们以进行删除?



让文件夹抓取和标记零件,我想象这看起来像这样:




  • 检查图像是否为灰度,因为这是假定不确定的。

  • 如果是,检测阴影的主要范围(背景颜色)。

  • 如果不是,检测阴影的主要范围,限制为浅灰色。

  • 尝试找到充分检测类型或书写或图像的网页的阈值。
  • b $ b
  • 可能一次测试图片的片段以提高阈值的准确性。



我知道这是一种

这里是一个替代的解决方案,使用

http://luispedro.org/software/mahotas\">mahotas 牛奶


  1. 从创建两个目录开始: positives / 否定/

  2. 我将假设其余的数据在 unlabeled / 目录

  3. 计算所有正面和负面图片的功能

  4. 学习分类

  5. 未分类图片上的分类

在下面的代码中,我使用了 jug ,让你可以在多个处理器上运行它,但是如果你删除每行提到 TaskGenerator

 来自glob import glob 
import mahotas
import mahotas.features
import milk
from jug import taskGenerator


@TaskGenerator
def features_for(imname):
img = mahotas.imread(imname)
return mahotas.features.haralick ).mean(0)

@TaskGenerator
def learn_model(features,labels):
learner = milk.defaultclassifier()
return learner.train标签)

@TaskGenerator
def classify(model,features):
return model.apply(features)

positives = glob * .jpg')
negatives = glob('negatives / *。jpg')
unlabeled = glob('unlabeled / *。jpg')


= map(features_for,negates + positives)
labels = [0] * len(negatives)+ [1] * len(positive)

model = learn_model
$ b labeled = [classify(model,features_for(u))for u in unlabeled]

这使用纹理特征,这可能是足够好,但你可以玩 mahotas.features 如果你想(或尝试 mahotas.surf ,但是变得更复杂)。一般来说,我发现很难用你正在寻找的硬阈值进行分类,除非扫描是非常可控的。


So I often run huge double-sided scan jobs on an unintelligent Canon multifunction, which leaves me with a huge folder of JPEGs. Am I insane to consider using PIL to analyze a folder of images to detect scans of blank pages and flag them for deletion?

Leaving the folder-crawling and flagging parts out, I imagine this would look something like:

  • Check if the image is greyscale, as this is presumed uncertain.
  • If so, detect the dominant range of shades (background colour).
  • If not, detect the dominant range of shades, restricting to light greys.
  • Determine what percentage of the entire image is composed of said shades.
  • Try to find a threshold that adequately detects pages with type or writing or imagery.
  • Perhaps test fragments of the image at a time to increase accuracy of threshold.

I know this is sort of an edge case, but can anyone with PIL experience lend some pointers?

解决方案

Here is an alternative solution, using mahotas and milk.

  1. Start by creating two directories: positives/ and negatives/ where you will manually pick out a few examples.
  2. I will assume that the rest of the data is in an unlabeled/ directory
  3. Compute features for all of the images in positives and negatives
  4. learn a classifier
  5. use that classifier on the unlabeled images

In the code below I used jug to give you the possibility of running it on multiple processors, but the code also works if you remove every line which mentions TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator


@TaskGenerator
def features_for(imname):
    img = mahotas.imread(imname)
    return mahotas.features.haralick(img).mean(0)

@TaskGenerator
def learn_model(features, labels):
    learner = milk.defaultclassifier()
    return learner.train(features, labels)

@TaskGenerator
def classify(model, features):
     return model.apply(features)

positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')


features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)

model = learn_model(features, labels)

labeled = [classify(model, features_for(u)) for u in unlabeled]

This uses texture features, which is probably good enough, but you can play with other features in mahotas.features if you'd like (or try mahotas.surf, but that gets more complicated). In general, I have found it hard to do classification with the sort of hard thresholds you are looking for unless the scanning is very controlled.

这篇关于使用PIL检测空白页的扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆