如何使用Java忽略Tesseract OCR中的特殊字符 [英] How to ignore special characters in Tesseract OCR using java

查看:452
本文介绍了如何使用Java忽略Tesseract OCR中的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Java通过Tesseract OCR从图像中提取了文本.但是输出包含一些特殊字符,因为图像包含一些符号.

I have extracted text from image through Tesseract OCR using java. But the output is consisting of some special characters because image contains some symbols.

我想忽略所有特殊字符,只显示文本.我有什么办法可以做到这一点?

I want to ignore all the special characters and display just text. Is there any way that i can do that?

推荐答案

在tesseract中,您可以设置TessBaseAPI.VAR_CHAR_WHITELISTTessBaseAPI.VAR_CHAR_BLACKLIST以便忽略某些特殊字符.

In tesseract you can set TessBaseAPI.VAR_CHAR_WHITELIST and TessBaseAPI.VAR_CHAR_BLACKLIST in order to ignore some special characters.

以下操作将使tesseract只识别A-Z和数字

Following would make tesseract only recognize A-Z and digits

String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);

下一个代码段将使您识别〜和fl以外的所有内容

Next snippet would allow you to recognize everything except for ~ and fl

String blackList = "~fl";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST,blackList );

还请注意,如 tesseract github问题所述,您可以' t带有 tesseract 4.0 Alpha LSTM 的黑名单或白名单字符,而应该使用希望在图像上显示的字符来训练LSTM.

Also please note that as mentioned in tesseract github issue you can't black or whitelist characters with tesseract 4.0 Alpha LSTM, instead you should train LSTM with characters you expect on your image.

当然,如果您愿意-您仍然可以使用3. *版本的tesseract,其tessdata位于此处

Of course if you want - you can still use 3.* versions of tesseract, its tessdata is located here

这篇关于如何使用Java忽略Tesseract OCR中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆