如何使用TIka读取大文件? [英] How to read large files using TIka?

查看:576
本文介绍了如何使用TIka读取大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Tika解析大型的pdf和word文档,但是我得到他的错误消息.

I'm parsing large pdf and word documents using Tika but I get he followiing error message.

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

如何增加限额?

推荐答案

假定您基本上遵循 Tika示例要提取为纯文本,那么您要做的就是 javadocs

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

然后您的代码应类似于(受示例启发的):

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}

这篇关于如何使用TIka读取大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆