删除水平下划线 [英] Removing horizontal underlines

查看:163
本文介绍了删除水平下划线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从数百张JPG中提取文本,这些JPG包含有关死刑记录的信息; JPG由德克萨斯州刑事司法部(TDCJ)托管.下面是一个示例代码段,其中删除了可识别个人身份的信息.

I am attempting to pull text from a few hundred JPGs that contain information on capital punishment records; the JPGs are hosted by the Texas Department of Criminal Justice (TDCJ). Below is an example snippet with personally identifiable information removed.

我已将下划线标识为妨碍使用适当的OCR的 -如果我输入以下内容,则对子代码片段进行了截图并手动将行涂成白色,通过 pytesseract 非常好.但是如果有下划线,那就太糟糕了.

I've identified the underlines as being the impediment to proper OCR--if I go in, screenshot a sub-snippet and manually white-out lines, the resulting OCR through pytesseract is very good. But with underlines present, it's extremely poor.

如何最好地去除这些水平线?我尝试过的:

How can I best remove these horizontal lines? What I have tried:

  • Started on OpenCV doc's walkthrough: Extract horizontal and vertical lines by using morphological operations. Got stuck pretty quickly, because I know zero C++.
  • Followed along with Removing Horizontal Lines in image - ended up with an illegible string.
  • Followed along with Removing long horizontal/vertical lines from edge image using OpenCV - wasn't able to get the intuition behind sizing the array of zeros here.

使用 docs演练到Python.我已经尝试了诸如Hugh Line Transform之类的一系列转换,但是我在黑暗中的图书馆和区域中感觉不到以前的经验.

Tagging this question with c++ in the hope that someone could help to translate Step 5 of the docs walkthrough to Python. I've tried a batch of transformations such as Hugh Line Transform, but I am feeling around in the dark within a library and area I have zero prior experience with.

import cv2

# Inverted grayscale
img = cv2.imread('rsnippet.jpg', cv2.IMREAD_GRAYSCALE)
img = cv2.bitwise_not(img)

# Transform inverted grayscale to binary
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 15, -2)

# An alternative; Not sure if `th` or `th2` is optimal here
th2 = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)[1]

# Create corresponding structure element for horizontal lines.
# Start by cloning th/th2.
horiz = th.copy()
r, c = horiz.shape

# Lost after here - not understanding intuition behind sizing/partitioning

推荐答案

到目前为止,所有答案似乎都在使用形态学运算.这里有些不同.如果行是水平,这应该会给出很好的结果.

All the answers so far seem to be using morphological operations. Here's something a bit different. This should give fairly good results if the lines are horizontal.

为此,我使用了下面显示的示例图像的一部分.

For this I use a part of your sample image shown below.

加载图像,将其转换为灰度并反转.

Load the image, convert it to gray scale and invert it.

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

倒置的灰度图像:

如果您在这张倒置的图像中扫描一行,您会发现其轮廓看起来有所不同,具体取决于是否存在线条.

If you scan a row in this inverted image, you'll see that its profile looks different depending on the presence or the absence of a line.

plt.figure(1)
plt.plot(gray[18, :] > 16, 'g-')
plt.axis([0, gray.shape[1], 0, 1.1])
plt.figure(2)
plt.plot(gray[36, :] > 16, 'r-')
plt.axis([0, gray.shape[1], 0, 1.1])

绿色的配置文件是没有下划线的行,红色是带有下划线的行.如果将每个配置文件的平均值取平均值,则会看到红色的平均值较高.

Profile in green is a row where there's no underline, red is for a row with underline. If you take the average of each profile, you'll see that red one has a higher average.

因此,使用这种方法,您可以检测并删除下划线.

So, using this approach you can detect the underlines and remove them.

for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cv2.imshow("gray", 255 - gray)
cv2.imshow("im", im)

在此检测到的红色下划线和已清洁的图像.

Here are the detected underlines in red, and the cleaned image.

已清理图像的tesseract输出:

tesseract output of the cleaned image:

Convthed as th(
shot once in the
she stepped fr<
brother-in-lawii
collect on life in
applied for man
to the scheme i|

现在应该清楚使用部分图像的原因.由于个人身份信息已在原始图像中删除,因此该阈值将无效.但是,将其应用于处理时应该不会有问题.有时您可能需要调整阈值(16,0.9).

Reason for using part of the image should be clear by now. Since personally identifiable information have been removed in the original image, the threshold wouldn't have worked. But this should not be a problem when you apply it for processing. Sometimes you may have to adjust the thresholds (16, 0.9).

除去部分字母并保留一些微弱的线条后,结果看起来效果不佳.如果可以进一步改善,将会更新.

The result does not look very good with parts of the letters removed and some of the faint lines still remaining. Will update if I can improve it a bit more.

更新:

进行一些改进;清理并链接字母的缺失部分.我已经注释了代码,因此我认为过程很清楚.您还可以检查生成的中间图像以查看其工作原理.结果要好一些.

Dis some improvements; cleanup and link the missing parts of the letters. I've commented the code, so I believe the process is clear. You can also check the resulting intermediate images to see how it works. Results are a bit better.

已清理图像的tesseract输出:

tesseract output of the cleaned image:

Convicted as th(
shot once in the
she stepped fr<
brother-in-law. ‘
collect on life ix
applied for man
to the scheme i|

已清理图像的tesseract输出:

tesseract output of the cleaned image:

)r-hire of 29-year-old .
revolver in the garage ‘
red that the victim‘s h
{2000 to kill her. mum
250.000. Before the kil
If$| 50.000 each on bin
to police.

python代码:

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample2.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# prepare a mask using Otsu threshold, then copy from original. this removes some noise
__, bw = cv2.threshold(cv2.dilate(gray, None), 128, 255, cv2.THRESH_BINARY or cv2.THRESH_OTSU)
gray = cv2.bitwise_and(gray, bw)
# make copy of the low-noise underlined image
grayu = gray.copy()
imcpy = im.copy()
# scan each row and remove lines
for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cont = gray.copy()
graycpy = gray.copy()
# after contour processing, the residual will contain small contours
residual = gray.copy()
# find contours
contours, hierarchy = cv2.findContours(cont, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
for i in range(len(contours)):
    # find the boundingbox of the contour
    x, y, w, h = cv2.boundingRect(contours[i])
    if 10 < h:
        cv2.drawContours(im, contours, i, (0, 255, 0), -1)
        # if boundingbox height is higher than threshold, remove the contour from residual image
        cv2.drawContours(residual, contours, i, (0, 0, 0), -1)
    else:
        cv2.drawContours(im, contours, i, (255, 0, 0), -1)
        # if boundingbox height is less than or equal to threshold, remove the contour gray image
        cv2.drawContours(gray, contours, i, (0, 0, 0), -1)

# now the residual only contains small contours. open it to remove thin lines
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
residual = cv2.morphologyEx(residual, cv2.MORPH_OPEN, st, iterations=1)
# prepare a mask for residual components
__, residual = cv2.threshold(residual, 0, 255, cv2.THRESH_BINARY)

cv2.imshow("gray", gray)
cv2.imshow("residual", residual)   

# combine the residuals. we still need to link the residuals
combined = cv2.bitwise_or(cv2.bitwise_and(graycpy, residual), gray)
# link the residuals
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
linked = cv2.morphologyEx(combined, cv2.MORPH_CLOSE, st, iterations=1)
cv2.imshow("linked", linked)
# prepare a msak from linked image
__, mask = cv2.threshold(linked, 0, 255, cv2.THRESH_BINARY)
# copy region from low-noise underlined image
clean = 255 - cv2.bitwise_and(grayu, mask)
cv2.imshow("clean", clean)
cv2.imshow("im", im)

这篇关于删除水平下划线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆