如何使用 TIka 读取大文件? [英] How to read large files using TIka?

查看:35
本文介绍了如何使用 TIka 读取大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Tika 解析大型 pdf 和 word 文档,但我收到了他的以下错误消息.

I'm parsing large pdf and word documents using Tika but I get he followiing error message.

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

如何提高限​​额?

推荐答案

假设您基本上遵循 Tika 示例提取到纯文本,那么你需要做的就是创建您的 BodyContentHandler,写入限制为 -1 以禁用写入限制,如 javadocs

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

然后您的代码将类似于(受示例启发):

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}

这篇关于如何使用 TIka 读取大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆