如何使用Apache TIka从文件中提取图像? [英] How to extract images from a file using Apache TIka?

查看:112
本文介绍了如何使用Apache TIka从文件中提取图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pdf(或其他任何类型的文件,例如.doc,.ppt等),其中包含文本和图像.如何使用Tika从这些文件中提取图像?

I have a pdf (or any other type of files such as .doc, .ppt, etc) which contain text as well as images. How can I extract images from those files using Tika?

还可以使用Tess4j或任何其他lib在提取的图像上运行OCR吗?

Can also run OCR on the extracted images using Tess4j or any other lib?

这就是我叫Tika的方式:

This is how I call Tika:

 AutoDetectParser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler(writeLimit);
 Metadata metadata = new Metadata();        
 InputStream stream = new FileInputStream("file.pdf");      
 parser.parse(stream, handler, metadata);   

p.s.我有tika-app.jar.

p.s. I have tika-app.jar.

推荐答案

方法:

        InputStream stream = new FileInputStream(inputFile);

        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(
                Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        ParseContext parseContext = new ParseContext();

        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(Parser.class, parser); // need to add this to make
                                                // sure recursive parsing
                                                // happens!
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        String text = handler.toString().trim();


1)确保从以下位置使用"tesseract-ocr-setup-3.05.00dev.exe"安装了tesseract. https://sourceforge.net/projects/tesseract-ocr-alt/files/并将其路径(如果是Windows,它将安装在程序文件中)放置在PATH环境变量中.如果需要,请重新启动Windows.传递任何(是的!)文件,它将解压缩.2)从以下位置下载tess4j-3.0.0.jar: https://sourceforge.net/projects/tess4j/?source=typ_redirect 并使用以下内容引用此罐子:


1) Ensure that you have tesseract installed using 'tesseract-ocr-setup-3.05.00dev.exe' from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and have its path (It will get installed in the program files, if windows) is placed in the PATH environment variable. Restart Windows if needed. Pass any (yes any!) file and it will extract. 2) Download tess4j-3.0.0.jar from: https://sourceforge.net/projects/tess4j/?source=typ_redirect and refer this jar using:

    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>3.0.0</version>
    </dependency>

然后,这些:

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.13</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.13</version>
    </dependency>

    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.5</version>
    </dependency>

    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.3.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna -->
    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna</artifactId>
        <version>4.2.2</version>
    </dependency>

    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.11</version>
    </dependency>

但是,如果使用Ubuntu,则应使用apt-get安装tesseract.会起作用的.

However, if using Ubuntu, tesseract should be installed using apt-get. It will work.

这篇关于如何使用Apache TIka从文件中提取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆