如何使用 Apache TIka 从文件中提取图像? [英] How to extract images from a file using Apache TIka?

查看:57
本文介绍了如何使用 Apache TIka 从文件中提取图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含文本和图像的 pdf(或任何其他类型的文件,如 .doc、.ppt 等).如何使用 Tika 从这些文件中提取图像?

I have a pdf (or any other type of files such as .doc, .ppt, etc) which contain text as well as images. How can I extract images from those files using Tika?

还可以使用 Tess4j 或任何其他库对提取的图像运行 OCR 吗?

Can also run OCR on the extracted images using Tess4j or any other lib?

我是这样称呼蒂卡的:

 AutoDetectParser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler(writeLimit);
 Metadata metadata = new Metadata();        
 InputStream stream = new FileInputStream("file.pdf");      
 parser.parse(stream, handler, metadata);   

附言我有 tika-app.jar.

p.s. I have tika-app.jar.

推荐答案

这样做的方法:

        InputStream stream = new FileInputStream(inputFile);

        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(
                Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        ParseContext parseContext = new ParseContext();

        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(Parser.class, parser); // need to add this to make
                                                // sure recursive parsing
                                                // happens!
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        String text = handler.toString().trim();

<小时>

1) 确保您已使用tesseract-ocr-setup-3.05.00dev.exe"安装了tesseract:https://sourceforge.net/projects/tesseract-ocr-alt/files/并将其路径(它将安装在程序文件中,如果是 Windows)放置在 PATH 环境变量中.如果需要,重新启动 Windows.传递任何(是任何!)文件,它将提取.2)从以下位置下载tess4j-3.0.0.jar:https://sourceforge.net/projects/tess4j/?source=typ_redirect并使用以下方法引用此 jar:


1) Ensure that you have tesseract installed using 'tesseract-ocr-setup-3.05.00dev.exe' from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and have its path (It will get installed in the program files, if windows) is placed in the PATH environment variable. Restart Windows if needed. Pass any (yes any!) file and it will extract. 2) Download tess4j-3.0.0.jar from: https://sourceforge.net/projects/tess4j/?source=typ_redirect and refer this jar using:

    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>3.0.0</version>
    </dependency>

然后,这些:

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.13</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.13</version>
    </dependency>

    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.5</version>
    </dependency>

    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.3.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna -->
    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna</artifactId>
        <version>4.2.2</version>
    </dependency>

    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.11</version>
    </dependency>

但是,如果使用 Ubuntu,则应使用 apt-get 安装 tesseract.它会起作用.

However, if using Ubuntu, tesseract should be installed using apt-get. It will work.

这篇关于如何使用 Apache TIka 从文件中提取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆