使用Apache Tika从PDF中提取图像 [英] Extract Images from PDF with Apache Tika

查看:1870
本文介绍了使用Apache Tika从PDF中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Tika 1.6能够从PDF文档中提取内嵌图像。但是,我一直在努力让它发挥作用。

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.

我的用例是我想要一些能够从任何文件中提取内容和单独图像的代码(不一定是PDF)。然后将其传递到Apache UIMA管道。

My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.

我已经能够通过使用自定义解析器(基于AutoParser构建)从其他文档类型中提取图像将文档转换为HTML,然后单独保存图像。当我尝试使用PDF时,标签甚至不会出现在HTML中,让我可以访问这些文件。

I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.

有人可以建议我如何实现上面,最好是一些代码示例,说明如何使用Tika 1.6从PDF中提取内联图像?

Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?

推荐答案

尝试下面的代码, ContentHandler转为您的xml内容。

Try the code bellow, ContentHandler turned has your xml content.

public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           

    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    ContentHandler handler =   new ToXMLContentHandler();
    PDFParser parser = new PDFParser(); 

    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(true);

    parser.setPDFParserConfig(config);


    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.copy(stream, outputFile);
        }
    };

    context.set(PDFParser.class, parser);
    context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

    try (InputStream stream = new ByteArrayInputStream(content)) {
        parser.parse(stream, handler, metadata, context);
    }

    return handler;
}

这篇关于使用Apache Tika从PDF中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆