使用 Apache tika 删除 PDFont 缓存 [英] Remove PDFont caching with Apache tika

查看：38 发布时间：2021/11/14 23:45:05 pdfbox apache-tika

本文介绍了使用 Apache tika 删除 PDFont 缓存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图仅从许多不同的代码(rtf doc pdf)中提取文本.我很自然地求助于 Apache Tika，因为它可以自动检测文档并相应地提取文本.我只对文本感兴趣，对格式等不感兴趣.

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.

我的应用程序最终出现了大内存泄漏，经过调查，这是来自 PDFBox 依赖项中 PDFFont 类的缓存.我对从 pdf 缓存 Fontmetrics 和其他字体格式问题不感兴趣，因为我只想提取文本.

My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.

我使用的是 tika 1.12.有谁知道如何解决这个缓存问题.这就是我使用自动检测的方式:

I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:

        AutoDetectParser parser = new AutoDetectParser();

        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
        ParseContext context = new ParseContext();              
        parser.parse(inputstream, handler, metadata, context);
        String s=null;
        s =handler.toString();
        handler=null;
        context=null;
        inputstream.close();
        PDFont.clearResources();

使用 Apache tika 删除 PDFont 缓存 [英] Remove PDFont caching with Apache tika

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Apache tika 删除 PDFont 缓存 [英] Remove PDFont caching with Apache tika

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭