是否有任何Java OCR工具将文本图像转换为可编辑的文本文件? [英] Do any Java OCR tools convert images of text into editable text files?

查看:132
本文介绍了是否有任何Java OCR工具将文本图像转换为可编辑的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个项目,需要拍摄文本(来自任何文本的硬拷贝)并将该文本转换为文本文件。然后我想使用该文本文件做一些不同的事情,例如提供新闻文章的超链接或允许用户编辑文档。

I'm working on a project that entails photographing text (from any hard copy of text) and converting that text into a text file. Then I'd like to use that text file to do some different things, such as provide hyperlinks to news articles or allow the user to edit the document.

到目前为止,我尝试过的工具是sourceforge.net的Java OCR,它可以很好地处理包中提供的图像。但是当我拍摄自己的文字时,它根本不起作用。我应该实施一些培训流程吗?如果是这样,有人知道如何实施它吗?任何帮助都会有很长的路要走。谢谢!

The tool I've tried so far is Java OCR from sourceforge.net, which works fine on the images provided in the package. But when I photograph my own text, it doesnt work at all. Is there some training process I should be implementing? If so, does anybody know how to implement it? Any help will go a long way. Thank you!

推荐答案

我有一个java应用程序,我最终决定使用 Tesseract OCR ,只需使用 Runtime.exec()调用它。也许不是你需要的答案,但万一你不考虑它。

I have a java application where I ended up deciding to use Tesseract OCR, and just call out to it using Runtime.exec(). Perhaps not quite the answer you need, but just in case you'd not considered it.


  • 在Windows安装中,我认为我能够使用安装程序,或者解压缩现成的二进制文件。

  • 在Linux服务器上,我需要自己编译Tesseract,但如果你已经习惯了这种事情(gcc),那就不难了。唯一的问题是,对 Leptonica 的依赖也需要编译。

  • On a Windows installation I think I was able to use an installer, or unzip a ready made binary.
  • On a Linux server, I needed to compile Tesseract myself, but it's not too hard if you're used to that kind of thing (gcc); the only gotcha is that there's a dependency on Leptonica which also needs to be compiled.

// Tesseract can only handle .tif format, so we have to convert it
ImageIO.write( ImageIO.read( new java.io.File(file.getPath())), "tif", tmpFile[0]);

String[] tesseractCmd = new String[]{"tesseract", tmpFile[0].getAbsolutePath(), StringUtils.removeEnd(tmpFile[1].getAbsolutePath(), ".txt")};
final Process process = Runtime.getRuntime().exec(tesseractCmd);
try {
    int exitValue = process.waitFor();
    if(exitValue == 0) {
        final String extractedText = SearchableTextExtractionUtils.extractPlainText(new FileReader(tmpFile[1]));
        return extractedText;
    }
    throw new SearchableTextExtractionException(exitValue, Arrays.toString(tesseractCmd));
} catch (InterruptedException e) {
    throw new SearchableTextExtractionException(e);
} finally {
    process.destroy();
}


这篇关于是否有任何Java OCR工具将文本图像转换为可编辑的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆