从python和opencv中的报纸图像中提取文章 [英] article extraction from newspaper image in python and opencv

查看:250
本文介绍了从python和opencv中的报纸图像中提取文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从报纸图像中提取文章,但是标题使用rlsa算法将第一幅图像中某些像素值的水平和垂直方向分开.如果尝试使用更大的像素值,则文章将合并,显示在第二个图像中.有人可以建议在python和opencv中将文章与图片分开的最佳方法吗?

I tried extracting articles from the newspaper image, but headings are being separated with rlsa algorithm horizontal and vertical of some pixel value in the first image. If I tried with more pixel value, articles are merging which is showed in second image. Can anyone suggest the best method to separate articles from the image in python and opencv?

    for i in range(1,a):
        c = 1
        for j in range(1, b):
            if im_bw[i, j] == 0:
                if (j-c) <= 10:
                    im_bw[i, c:j] = 0

                c = j


        if (b - c) <= 10:
            im_bw[i, c:b] = 0

此循环用于图像上的游程长度平滑算法垂直

    for i in range(1, b):
        c = 1
        for j in range(1, a):
            if im_bw[j, i] == 0:
                if (j-c) <= 9:
                    im_bw[c:j, i] = 0

                c = j


        if (b - c) <= 9:
            im_bw[c:b, i] = 0

a是行数 b是二进制图像的列数

a is number of rows b is number of columns of an binary image

算法如何处理二进制图像和红色标记显示文章的合并

How algorithm worked on binary image and red mark shows the merging of articles

推荐答案

我有一种方法适用于大多数图像.

I have an approach worked for most of the images.

  1. 使用PIL/Opencv对彩色/灰度图像进行二进制转换.
  2. 将图像中的图片作为轮廓删除,与 图片中所有轮廓的平均面积.
  3. 使用 canny 边缘过滤器和 houghlines
  4. 删除线
  5. 在此二进制图像上使用 RLSA (行程长度平滑算法).可以在此存储库中找到此 RLSA 的描述和代码 https://github .com/Vasistareddy/python-rlsa
  1. Binary conversion of color/gray scale images using PIL/Opencv.
  2. Remove pictures from image as contours with largest area compared to average area of all the contours present in the image.
  3. Remove lines using canny edge filter and houghlines
  4. Use RLSA(run length smoothing algorithm) on this binary image. Description and Code for this RLSA can be found on this repository https://github.com/Vasistareddy/python-rlsa

删除行会有所帮助,因为某些电子纸会将行保留为文章分隔符. 通过对图像进行更多处理,我们可以获得更好的结果.在执行上述步骤后,可以在图像上剩下的轮廓上实现平均宽度,平均高度,平均面积等启发式方法.

Removing lines helps because some e-papers keeps lines as article separator. We can achieve better results with more processing of the images. Heuristics like average width, average height, average area can be implemented on the contours left on the image after applying above steps to achieve better results.

谈到上述问题,文章始终带有白色背景.如果没有白色背景,则显然是广告"或图片"或其他"东西. 从上述4个步骤中删除图片可以清除,即可解决此问题.

Coming to the above question, the articles always with the white background. Without white background are clearly "Ads" or "pictures" or "miscellaneous" stuff. Removing pictures from the above 4 mentioned steps clears solves this issue.

PS:选择 RLSA 水平和垂直的值始终是个谜.由于文章之间的差异因版本而异.

PS: Choosing a value for RLSA horizontal and vertical is always mystery. Since the gap of article varies from edition to edition.

这篇关于从python和opencv中的报纸图像中提取文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆