使用 POI 或 Tika 提取文本,流到流,无需在内存中加载整个文件 [英] Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

查看:55
本文介绍了使用 POI 或 Tika 提取文本,流到流,无需在内存中加载整个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试单独使用 Apache POI 和 PDFBox,或者在 Apache Tika 的上下文中,从 MASSIVE Microsoft Office 和 PDF 文件(即在某些情况下数百兆)中提取和处理纯文本.此外,我的应用程序是多线程的,因此我将同时解析许多这些大文件.

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.

在这种规模下,我必须以流式方式处理文件.在此过程中的任何步骤都不能将整个文件保存在主内存中.

At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.

我见过许多通过输入流将文件加载到 Tika/POI/PDFBox 的源代码示例.我见过很多通过输出流提取纯文本的例子.但是,我已经执行了一些基本的内存分析实验......我还没有找到使用这些库(Tika、POI 或 PDFBox)中的任何一个的方法来避免将整个文档加载到主内存中.

I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.

在读取流和写入流之间,中间显然有转换步骤......我还没有找到一种在流的基础上执行的方法.我是否遗漏了什么,或者这是使用 Tika/POI/PDFBox 从 MS Office 或 PDF 文件中提取文本的已知问题?我能否实现真正的端到端流传输,而不会在传输过程中的任何时候将文件完全加载到主内存中?

In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?

推荐答案

如果您关心内存占用,首先要确保您使用的是 由文件支持的TikaInputStream,例如从某事改变喜欢

The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like

InputStream input = new FileInputStream("foo.xls");

类似的东西

InputStream input = TikaInputStream.get(new File("foo.xls"));

如果你真的只有一个 InputStream 而不是一个文件,并且如果可能的话你想要较低的内存选项,强制 Tika 将它缓冲到一个临时文件中

If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like

InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();

许多但并非所有解析器都能够利用后备文件并仅将所需的位读取到内存中,而不是缓冲整个内容,这会有所帮助

Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help

.

接下来,确保您的 ContentHandler 在输出之前不会将整个内容缓冲到内存中.对结果文档进行 XPath 查找的任何内容都可能已失效,任何具有内部 StringBuffer 或类似内容的内容也可能已失效.选择一个更简单的,并确保您已设置好将生成的 html/text sax 事件写入某处

Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in

.

最后,并非所有 Tika 解析器都支持流处理.有些只能通过解析整个文件的结构来工作,然后在其中徘徊,找到要输出的有趣位.有了这些,使用文件支持的 TikaInputStream 可能会有所帮助,但不会停止使用相当多的内存.

Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.

IIRC,低内存解析器包括:

IIRC, the low memory parsers include:

  • .xls
  • .xlsx
  • 所有基于 ODF 的格式
  • XML

在能够输出任何内容之前加载+解析大部分/所有文件的一些常见文档解析器包括:

Some of the common document parsers which load + parse most/all of the file before being able to output anything include:

  • .doc/.docx/.ppt/.pptx
  • .pdf
  • 图片
  • 视频

这篇关于使用 POI 或 Tika 提取文本,流到流,无需在内存中加载整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆