如何从文档图像检测文本区域? [英] How to detect text region from a document image?

查看:228
本文介绍了如何从文档图像检测文本区域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件图片,可能是报纸或杂志。例如,扫描的报纸。我想删除所有/大多数文本,并保持图像在文档中。任何人知道如何检测文档中的文本区域?下面是一个例子。先感谢!

I have a document image, which might be a newspaper or magazine. For example, a scanned newspaper. I want to remove all/most text and keep images in the document. Anyone know how to detect text region in the document? Below is an example. Thanks in advance!

示例图片: https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg

推荐答案

对象识别的通常模式将在这里工作 - 阈值,检测区域,过滤区域,然后对剩余的区域进行所需的操作。

The usual pattern of object recognition will work here - threshold, detect regions, filter regions, then do what you need with the remaining regions.

容易这里。背景是纯白色(或可以被过滤为纯白色),因此在反转的灰度图像中高于0的任何东西是文本或图像。然后可以在此阈值二进制图像内检测区域。

Thresholding is easy here. The background is pure white (or can be filtered to be pure white) so anything that is above 0 in the inverted grayscale image is either text or an image. Then regions can be detected within this thresholded binary image.

为了过滤区域,我们只需要确定什么使文本与图片不同。文本区域将变小,因为每个字母都是其自己的区域。图片是比较大的地区。使用适当阈值按区域区域过滤将拉出所有图片,并删除所有文本,假设没有图片是关于页面上任何位置的单个字母的大小。如果它们是,则可以使用其他过滤标准(饱和度,色调方差,...)。

For filtering the regions, we just have to identify what makes the text different from the pictures. Text regions are going to be small since every letter is its own region. Pictures are big regions in comparison. Filtering by region area with the proper threshold will pull out all of the pictures and remove all of the text, assuming none of the pictures are about the size of a single letter anywhere on the page. If they are then other filtering criteria can be used (saturation, hue variance, ...).

一旦区域被区域和饱和度标准过滤,可以通过将落入经过滤区域的边界框内的原始图像中的像素插入到新图像中来创建图像。

Once the regions are filtered by the area and saturation criteria then a new image can be created by inserting the pixels in the original image that fall within the bounding boxes of the filtered regions into a new image.

MATLAB实作:

%%%%%%%%%%%%
% Set these values depending on your input image

img = imread('https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg');

MinArea = 2000; % Minimum area to consider, in pixels
%%%%%%%%%
% End User inputs

gsImg = 255 - rgb2gray(img); % convert to grayscale (and invert 'cause that's how I think)
threshImg = gsImg > graythresh(gsImg)*max(gsImg(:)); % Threshold automatically

% Detect regions, using the saturation in place of 'intensity'
regs = regionprops(threshImg, 'BoundingBox', 'Area');

% Process regions to conform to area and saturation thresholds
regKeep = false(length(regs), 1);
for k = 1:length(regs)

    regKeep(k) = (regs(k).Area > MinArea);

end

regs(~regKeep) = []; % Delete those regions that don't pass qualifications for image

% Make a new blank image to hold the passed regions
newImg = 255*ones(size(img), 'uint8');

for k = 1:length(regs)

    boxHere = regs(k).BoundingBox; % Pull out bounding box for current region
    boxHere([1 2]) = floor(boxHere([1 2])); % Round starting points down to next integer
    boxHere([3 4]) = ceil(boxHere([3 4])); % Round ranges up to next integer
    % Insert pixels within bounding box from original image into the new
    % image
    newImg(boxHere(2):(boxHere(2)+boxHere(4)), ...
        boxHere(1):(boxHere(1)+boxHere(3)), :) = img(boxHere(2):(boxHere(2)+boxHere(4)), ...
        boxHere(1):(boxHere(1)+boxHere(3)), :);

end

% Display
figure()
image(newImg);

正如你在下面链接的图片中看到的,除了图片和标头广告之外的所有图片都被删除。好的是,如果你在远离首页的报纸上工作,这将与彩色和灰度图像一起工作。

As you can see in the image linked below, it does what is needed. All but the pictures and the masthead are removed. The good thing is that this will work just fine with colored and grayscale images if you're working with newspapers away from the front page.

结果:

http://imgur.com/vEmpavY,dd172fr#1

这篇关于如何从文档图像检测文本区域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆