使用 Apache tika 删除 PDFont 缓存 [英] Remove PDFont caching with Apache tika
问题描述
我试图仅从许多不同的代码(rtf doc pdf)中提取文本.我很自然地求助于 Apache Tika,因为它可以自动检测文档并相应地提取文本.我只对文本感兴趣,对格式等不感兴趣.
I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
我的应用程序最终出现了大内存泄漏,经过调查,这是来自 PDFBox 依赖项中 PDFFont 类的缓存.我对从 pdf 缓存 Fontmetrics 和其他字体格式问题不感兴趣,因为我只想提取文本.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
我使用的是 tika 1.12.有谁知道如何解决这个缓存问题.这就是我使用自动检测的方式:
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
推荐答案
所以我捏造了一个解决方法,只是在每次文件处理完成时调用 System.gc();
并没有真正回答这个问题.
So I fudged a workaround and just called System.gc();
everytime the file had finished being processed which works a treat but doesn't really answer the question.
这篇关于使用 Apache tika 删除 PDFont 缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!