使用POI或Tika提取文本,流对流,而无需将整个文件加载到内存中 [英] Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

查看:541
本文介绍了使用POI或Tika提取文本,流对流,而无需将整个文件加载到内存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试单独使用Apache POI和PDFBox,或者在Apache Tika的上下文中,使用MASSIVE Microsoft Office和PDF文件(在某些情况下为数百兆)提取和处理纯文本.另外,我的应用程序是多线程的,因此我将同时解析许多这些大文件.

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.

在那种规模下,我必须以流方式处理文件.在此过程的任何步骤中,都不可以将整个文件保存在主存储器中.

At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.

我已经看到了许多通过输入流将文件加载到Tika/POI/PDFBox中的源代码示例.我已经看到了许多通过输出流提取纯文本的示例.但是,我已经执行了一些基本的内存配置文件实验...并且我还没有找到使用这些库(Tika,POI或PDFBox)中的一种方法来避免将整个文档加载到主内存中的方法.

I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.

在从流读取和写入流之间,显然中间存在转换步骤...我还没有找到一种在流基础上执行的方法.我是否缺少某些东西,或者这是使用Tika/POI/PDFBox从MS Office或PDF文件提取文本的已知问题?我可以进行真正的端到端流传输,而在此过程中的任何时候都没有将文件完全加载到主存储器中吗?

In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?

推荐答案

如果您关心内存占用量,首先要确定的是正在使用

The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like

InputStream input = new FileInputStream("foo.xls");

类似

InputStream input = TikaInputStream.get(new File("foo.xls"));

如果您实际上只有一个InputStream而不是一个文件,并且希望尽可能使用较低的内存选项,请强制Tika使用类似以下内容将其缓冲到临时文件中

If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like

InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();

很多,但不是所有的解析器都能够利用备份文件,并且仅将所需的位读取到内存中,而不是缓存整个内容,这会有所帮助

Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help

.

接下来,请确保在输出之前,ContentHandler不会将全部内容缓冲到内存中.在结果文档上执行XPath查找的所有内容,包括内部StringBuffer或类似内容的任何内容,都可能会丢失.选择一个简单的事件,并确保您设置为将产生的html/text sax事件写入事件

Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in

.

最后,并非所有的Tika解析器都支持流处理.有些仅通过解析整个文件的结构,然后在找到所需的感兴趣位进行输出时徘徊.有了这些,使用支持文件的TikaInputStream可能会有所帮助,但不会停止使用大量内存.

Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.

IIRC,低内存解析器包括:

IIRC, the low memory parsers include:

  • .xls
  • .xlsx
  • 所有基于ODF的格式
  • XML

一些常见的文档解析器可以在输出任何内容之前加载+解析大多数/所有文件,包括:

Some of the common document parsers which load + parse most/all of the file before being able to output anything include:

  • .doc/.docx/.ppt/.pptx
  • .pdf
  • 图片
  • 视频

这篇关于使用POI或Tika提取文本,流对流,而无需将整个文件加载到内存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆