使用 Apache Tika 从 PDF 中提取图像 [英] Extract Images from PDF with Apache Tika

查看:65
本文介绍了使用 Apache Tika 从 PDF 中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Tika 1.6 能够从 PDF 文档中提取内嵌图像.但是,我一直在努力让它发挥作用.

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.

我的用例是我想要一些代码来提取内容并将图像从任何文档(不一定是 PDF)中分离出来.然后将其传递到 Apache UIMA 管道中.

My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.

通过使用自定义解析器(构建在 AutoParser 上)将文档转换为 HTML,然后单独保存这些图像,我已经能够从其他文档类型中提取图像.但是,当我尝试使用 PDF 时,标签甚至不会出现在 HTML 中,让我可以访问这些文件.

I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.

有人可以建议我如何实现上述目标,最好提供一些代码示例,说明如何使用 Tika 1.6 从 PDF 中提取内联图像?

Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?

推荐答案

可以使用 AutoDetectParser 来提取图像,而无需依赖 PDFParser.此代码同样适用于从 docx、pptx 等中提取图像.

It is possible to use an AutoDetectParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.

这里我有一个 parseDocument() 和一个 setPdfConfig() 函数,它使用了 AutoDetectParser.

Here I have a parseDocument() and a setPdfConfig() function which makes use of an AutoDetectParser.

  1. 我创建了一个 AutoDetectParser
  2. EmbeddedDocumentExtractor 附加到 ParseContext.
  3. AutoDetectParser 附加到相同的 ParseContext.
  4. PDFParserConfig 附加到相同的 ParseContext.
  5. 然后将 ParseContext 交给 AutoDetectParser.parse().
  1. I create an AutoDetectParser
  2. Attach a EmbeddedDocumentExtractor onto a ParseContext.
  3. Attach the AutoDetectParser onto the same ParseContext.
  4. Attach a PDFParserConfig onto the same ParseContext.
  5. Then give that ParseContext to AutoDetectParser.parse().

图像保存在与源文件相同位置的文件夹中,名称为_/.

The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/.

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoDetectParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}

这篇关于使用 Apache Tika 从 PDF 中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆