Windows 64位上的Tess4j:多线程上的异常 [英] Tess4j on Windows 64-bit: exception on multiple threads

查看:1019
本文介绍了Windows 64位上的Tess4j:多线程上的异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows 64位上使用tesseract 3和Java 8到OCR扫描的PDF。我已经关注了 Tess4j页面上的说明并使用了所需DLL的64位版本,并已安装64位Ghostscript。

I am using tesseract 3 with Java 8 on Windows 64-bit to OCR scanned PDFs. I have followed the instructions on the Tess4j page and have used the 64-bit versions of the required DLLs, and have installed 64-bit Ghostscript.

当我使用正常的@Test(无参数)运行我的单元测试时,代码正确运行,所以我想我已经安装了所有内容正确。

When I run my unit test with the normal @Test (no arguments), the code runs correctly, so I guess I have installed everything correctly.

当我用2个并行线程运行它时(见下文)我得到一个例外。

When I run it with 2 threads in parallel (see below) I get an exception.

我有阅读相关主题此处,但建议使用Tesseract1,我正在使用它(我已经尝试了两种)。

I have read the relevant thread here, but there it is suggested to use Tesseract1, which I am using (I have tried both).

任何想法?

这是代码:

//  @Test // works
@Test(invocationCount = 2, threadPoolSize = 2)
public void testOcr() throws OcrException, TesseractException {
    File scannedPdf = new File(this.getClass().getClassLoader().getResource("scanned.pdf").getFile());
//  Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
    Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
    String str = instance.doOCR(scannedPdf);
    System.out.println("OCR Result: " + str);
}

这是例外:

log4j:WARN No appenders could be found for logger (org.ghost4j.Ghostscript).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Ιουλ 16, 2014 6:22:23 ΜΜ net.sourceforge.vietocr.PdfUtilities convertPdf2Png
SEVERE: Cannot initialize Ghostscript interpreter. Error code is -21
org.ghost4j.GhostscriptException: Cannot initialize Ghostscript interpreter. Error code is -21
    at org.ghost4j.Ghostscript.initialize(Ghostscript.java:365)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

java.lang.Error: Invalid memory access
    at com.sun.jna.Native.invokeInt(Native Method)
    at com.sun.jna.Function.invoke(Function.java:383)
    at com.sun.jna.Function.invoke(Function.java:315)
    at com.sun.jna.Library$Handler.invoke(Library.java:212)
    at com.sun.proxy.$Proxy3.gsapi_init_with_args(Unknown Source)
    at org.ghost4j.Ghostscript.initialize(Ghostscript.java:350)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
net.sourceforge.tess4j.TesseractException: javax.imageio.IIOException: I/O error reading header!
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.imageio.IIOException: I/O error reading header!
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:224)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.locateImage(TIFFImageReader.java:231)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.getNumImages(TIFFImageReader.java:279)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    ... 18 more
Caused by: java.io.EOFException
    at javax.imageio.stream.ImageInputStreamImpl.readShort(ImageInputStreamImpl.java:229)
    at javax.imageio.stream.ImageInputStreamImpl.readUnsignedShort(ImageInputStreamImpl.java:242)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:199)
    ... 21 more

UPDATE :它似乎与这个

推荐答案

Tesseract本身只能将图像转换为文本,而不能转换为PDF,即使扫描PDF也是如此。

Tesseract on its own can only convert images to text, and not PDFs, even if the PDFs are scanned.

引人注目的是,Tess4j使用Ghostscript(通过ghost4j)将每个页面转换为单个图像文件,然后将其提供给Tesseract进行OCR。它将结果字符串连接成一个字符串,然后返回。

Under the hood, Tess4j uses Ghostscript (through ghost4j) to convert each page to a single image file, which it then feeds to Tesseract for OCR. It concatenates the resulting strings into a single string, which it returns.

异常的原因是Tess4j以不支持多线程的方式使用Ghost4j。如此处所述,ghost4j 从其高级别提供多线程支持API(实际上它分别运行不同的Ghostscript实例,每个实例都从不同的JVM调用)。但是,Tess4j使用其低级API,可以使用单个Ghostscript实例。

The reason for the exception is that Tess4j uses Ghost4j in a way that does not support multithreading. As described here, ghost4j does provide multithreading support from its high-level API (actually it runs different instances of Ghostscript separately each invoked from a different JVM). Tess4j, however, uses its low-level API, where a single Ghostscript instance may be used.

这篇关于Windows 64位上的Tess4j:多线程上的异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆