简单的方法来检测和裁剪图像中的文字块(段落)? [英] Easy ways to detect and crop blocks (paragraphs) of text out of image?

查看:58
本文介绍了简单的方法来检测和裁剪图像中的文字块(段落)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经对该主题进行了一些研究,但是我认为我的问题与之前提出的问题有很大的不同.

I have done some research on the subject, but I think my question is significantly different from what has been asked before.

我的博士论文涉及对旧字典进行OCR,然后将结果自动转换为类似XML的数据库.我已经弄清楚了这部分.但是,我想通过显示用于每个条目/词条的扫描片段来丰富最终结果.由于字典将近9000页,因此手动完成字典是不可能的.

My PhD thesis deals with OCR-ing an old dictionary and converting the result into an XML-like database automatically. This part I have figured out. However, I'd like to enrich the final result by displaying a fragment of scan used for each entry/headword. As the dictionary is almost 9000 pages long, doing it manually is out of the question.

这是随机页面的外观: http://i.imgur.com/X2mPZr0.png

This is how a random page looks: http://i.imgur.com/X2mPZr0.png

由于每个条目始终等于一个段落,所以我想找到一种方法将每个图像分成带有文本(不需要OCR)作为单独文件的矩形,如下所示(不绘制矩形): http://i.imgur.com/CWtQD6Q.png

As each entry always equals one paragraph, I would like to find a way to split every image into rectangles with text (no OCR needed) as separate files, like this (without drawing the rectangles): http://i.imgur.com/CWtQD6Q.png

好处是,我进行的扫描的形状和大小相同,并且页边距/文本对齐方式相似.每个段落也总是有一个标识.

The good thing is that the scans I have are identical in shape and size, and similar in terms of margins/text alignment. Every paragraph always has an identation, too.

不好的是,我主要是语言学家,而不是程序员.我的大部分经验是使用Ruby,XML和CSS.而且有些段落只有一行.

The bad thing is that I am mostly a linguist and not much of a programmer. Most of my experience is with Ruby, XML and CSS. And that some paragraphs are only one-line long.

我知道有一些方法可以完成类似的事情:

I am aware of some ways do to a similar thing:

  • Algorithm to detect presence of text on image
  • http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html
  • http://answers.opencv.org/question/27411/use-opencv-to-detect-text-blocks-send-to-tesseract-ios/
  • https://github.com/kanaadp/iReader

但是它们需要我大量的时间来学习(特别是我对Python的了解为0),我不知道它们是否不仅允许文本检测,还允许段落检测.

but they are require significant amount of time for me to learn (especially that I have 0 knowledge in Python) and I don't know if they allow not only for text detection, but also paragraph detection.

对此事的任何投入/建议将不胜感激,尤其是对新手友好.

Any input/suggestion on the matter would be greatly appreciated, especially newbie-friendly.

推荐答案

我有几点想法要分享...我想我会遵循以下思路:

I have a few ideas to share... I think I would proceed along these lines:

仅供参考的低分辨率副本

LOW-RESOLUTION COPY OF ORIGINAL IMAGE JUST FOR REFERENCE

第1步-黑白阈值

我想我会为此使用OpenCV的Otsu阈值.

I think I would use OpenCV's Otsu thresholding for this.

第2步-查找垂直黑线

我将对图像每一列中的像素求平均,然后找到平均值最低的像素,该像素应该是中间的垂直线.下面的代码输出:

I would average the pixels in every column of the image and find the one with the lowest average and that should be the vertical line up the middle. Code below outputs:

Centreline at column: 1635

第3步-将图像一分为二并修剪多余的空白

第4步-框过滤器

我将使用55x45的框进行框过滤,该框与每个段落开头的缩进匹配,然后与阈值匹配,因此所有段落的开头都用黑框标记.

I would box filter with a 55x45 box that matches the indent at the start of each paragraph then threshold so all paragraph starts are marked with black boxes.

我对OpenCV还是很陌生,但是对上述想法进行了如下编码-我确信其中很多可以变得更加健壮和高效,因此将其视为概念性的;-)

I am pretty new to OpenCV but have coded the above ideas as follows - I m sure lots of it could be made more robust and more efficient so treat it as conceptual ;-)

#include <iostream>
#include <opencv2/opencv.hpp>

using namespace cv;
using namespace std;

int
main(int argc,char*argv[])
{
   // Load image
   Mat orig=imread("page.png",IMREAD_COLOR);

   vector<int> PNGwriteOptions;
   PNGwriteOptions.push_back(CV_IMWRITE_PNG_COMPRESSION);
   PNGwriteOptions.push_back(9);

   // Get greyscale and Otsu-thresholded version
   Mat bw,grey;
   cvtColor(orig,grey,CV_RGB2GRAY);
   threshold(grey,bw,0,255,CV_THRESH_BINARY|CV_THRESH_OTSU);

   // Find vertical centreline by looking for lowest column average - i.e. darkest vertical bar
   Mat colsums;
   reduce(bw,colsums,0,CV_REDUCE_AVG);
   double min,max;
   Point min_loc, max_loc;
   minMaxLoc(colsums,&min,&max,&min_loc,&max_loc);
   cout << "Centreline at column: " << min_loc.x << endl;

   namedWindow("test",CV_WINDOW_AUTOSIZE);

   // Split image into left and right
   Rect leftROI(0,0,min_loc.x,bw.rows);
   Mat  leftbw=bw(leftROI);
   Rect rightROI(min_loc.x+8,0,bw.cols-min_loc.x-8,bw.rows);
   Mat  rightbw=bw(rightROI);
   imshow("test",leftbw);
   waitKey(0); 
   imshow("test",rightbw);
   waitKey(0); 

   // Trim surrounding whitespace off
   Mat Points;
   Mat inverted =  cv::Scalar::all(255) - leftbw;
   findNonZero(inverted,Points);
   Rect bRect=boundingRect(Points);
   Mat lefttrimmed=leftbw(bRect);

   inverted =  cv::Scalar::all(255) - rightbw;
   findNonZero(inverted,Points);
   bRect=boundingRect(Points);
   Mat righttrimmed=rightbw(bRect);

   imwrite("lefttrimmed.png",lefttrimmed,PNGwriteOptions);
   imwrite("righttrimmed.png",righttrimmed,PNGwriteOptions);

   // Box filter with 55x45 rectangle to match size of paragraph indent on left
   Mat lBoxFilt,rBoxFilt;
   boxFilter(lefttrimmed,lBoxFilt,-1,Size(55,45));
   normalize(lBoxFilt,lBoxFilt,0,255,NORM_MINMAX,CV_8UC1);
   threshold(lBoxFilt,lBoxFilt,254,255,THRESH_BINARY_INV);
   imwrite("leftBoxed.png",lBoxFilt,PNGwriteOptions);

}

以防万一,您需要手工来构建此代码-似乎很难对它进行编译和链接-我制作了这样的CMakeLists.txt文件,并将其存储在与源文件相同的目录中.然后,创建一个名为build的子目录,以进行"out-of-source" 的构建,构建过程为:

Just in case you need a hand to build this code - as it seems non-trivial to compile and link anything against it - I made my CMakeLists.txt file like this and stored it in the same directory as the source file. Then I create a sub-directory called build to do an "out-of-source" build in and the build process is:

cd build
cmake ..
make -j 8
./demo

CMakeLists.txt

cmake_minimum_required(VERSION 2.8)
project(demo)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")
find_package(OpenCV)
add_executable(demo main.cpp)
target_link_libraries(demo ${OpenCV_LIBS})

关键字:图像处理,书本,页边距,书脊,中心线,页面,折痕,折叠,装订线,装订,缝合,文本,段落,检测,检测.

Keywords: Image processing, book, margin, spine, centreline, page, crease, fold, gutter, binding, stitching, text, paragraph, detect, detection.

这篇关于简单的方法来检测和裁剪图像中的文字块(段落)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆