选择使用 Tesseract OCR 提取的文本部分 [英] Select part of text that was extracted using the Tesseract OCR

查看:51
本文介绍了选择使用 Tesseract OCR 提取的文本部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中使用最新的 Tesseract OCR 引擎从几张图像中提取文本.它工作得很好,我对结果很满意.问题是我不想要全文,只想要一部分,但我不知道如何提取.

I'm using the latest Tesseract OCR engine in R to extract text from a couple of images. It works pretty well and I'm happy with the results. The problem is that I don't want the whole text, just some part, but I don't know how to extract it.

代码是这样的:

library("tesseract") 
library("pdftools")
library("magick")

mypdfFile<-"C:/Users/.../fileName.pdf"

mypngFile<-pdf_convert(mypdfFile, format="png", pages=1, dpi=600)

myImage<-image_read("fileName_1.png")

textFile<-ocr(myImage,engine = tesseract("spa"), HOCR = FALSE) # Text is in spanish

cat(textFile) 

现在,最终结果看起来像这样

Now, the end result looks like this

bla bla bla bla bla bla 
bla text that I want to 
extract bla bla bla bla 
bla bla bla bla bla bla  

我怎样才能获得我想提取的文本,而且仅此而已?

How can I get the text that I want to extract and only that?

我尝试在应用 ocr() 函数之前裁剪图像,但仅裁剪该部分是不可行或非常准确的.ocr() 返回纯文本.

I tried to crop the image before applying the ocr() function, but it's not feasible or very accurate to just crop that part. ocr() returns plain text.

完整示例如下

图片(原为pdf文件)是电费单.由于隐私问题,我无法完整提供它,但它看起来像这样 示例图片.在NOMBRE Y DIRECCION(名称和地址)下,应该有两行(一行是名称,另一行是地址)后面跟着GALEANA CENTRO LERDO.CP"(城市名称)和35150 LERDO,DGO."(邮政编码和州).我的代码看起来像这样

The image (originally a pdf file) is an electricity bill. I can't provide it in full due to privacy issues, but it looks like this sample image. Under NOMBRE Y DIRECCION (name and address), there should be two lines (one with the name and the other with the address) followed by "GALEANA CENTRO LERDO. C.P. " (the name of the city) and "35150 LERDO,DGO." (zip code and state). My code looks like this

myImage<-image_read("sampleImage.png")

myImage<-image_crop(myImage, new dimensions) #crop the right half and some from the top

textFile<-ocr(myImage,engine = tesseract("spa"), HOCR = FALSE) 

cat(textFile) 

我明白了

Nombre y Domicilio
NAME REDACTED 
ADDRESS REDACTED
GALEANA CENTRO LERDO. C.P.
35150 LERDO, DGO.
Cuenta E Tarifa
30DC27B011164660 General < 25kW 02
AE A MA E
Num. de Lectura Lectura Mult. C
Medidor actual anterior
BD6687 40994 40539 1 ¿
Apoyo gubernamental

我只想从中提取NAME REDACTED"和35150 LERDO, DGO"之间的所有内容.包括的.

I just want to extract from this everything between "NAME REDACTED" and "35150 LERDO, DGO." inclusive.

推荐答案

如果您知道文本的位置,您可以先裁剪图像,或者您可以使用例如 whitelist,请参阅此处.

You could either crop the image first if you know where your text is, or you could restrict what tesseract is looking for using for example a whitelist, see here.

经过评论,我们确实可以检索到地址,这里使用逻辑在提到地址"的行之后的两行

After comments, we could indeed retrieve the address, here using the logic "the two lines after the line where "Address" is mentioned

text <- ("Nombre y Domicilio
NAME REDACTED 
ADDRESS REDACTED
GALEANA CENTRO LERDO. C.P.
35150 LERDO, DGO.
Cuenta E Tarifa
30DC27B011164660 General < 25kW 02
AE A MA E
Num. de Lectura Lectura Mult. C
Medidor actual anterior
BD6687 40994 40539 1 ¿
Apoyo gubernamental")

library(dplyr)
text2 <- strsplit(text, "\n") %>% unlist()
addressline <- which(grepl("address", text2, ignore.case = T))
addresslines <- c(addressline+1:2)
address_extracted <- text2[addresslines]
address_extracted
[1] "GALEANA CENTRO LERDO. C.P." "35150 LERDO, DGO."  

这篇关于选择使用 Tesseract OCR 提取的文本部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆