如何为几种文档类型正确配置Apache Tika? [英] How to properly configure Apache Tika for a few document types?

查看:167
本文介绍了如何为几种文档类型正确配置Apache Tika?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Tika已有一段时间了,我知道应该只使用具有默认或自定义TikaConfig(代表org/apache/tika/mime/tika-mimetypes.xml文件)的Tika Facade.

I've been using Tika for a while and I know that one is supposed to use only Tika facade with either default or custom TikaConfig that represents org/apache/tika/mime/tika-mimetypes.xml file.

我的应用程序不允许使用不同于html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt,msg

My application doesn't allow any document type different than html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt,msg

,默认的MediaTypes包含许多其他类型.

and the default MediaTypes includes tons of others.

我们是否应该修改tika-mimetypes.xml以便删除不需要的MimeType?然后据我所知,它将仅为这些MimeType创建复合解析器和检测器.

Are we supposed to modify tika-mimetypes.xml so that we remove MimeTypes that we don't need ? Then as I understand it will create composite parsers and detectors only for these MimeTypes.

但是当提供不受支持的类型时会发生什么呢?我应该只是捕获TikaException或某些SAXException并拒绝该文件吗?

But what happens when it is supplied unsupported type ? Should I just catch TikaException or some SAXException and decline the file ?

另外,应该如何手动编辑tika-mimetypes.xml呢?它具有1290个MimeType,其中大多数都是可笑的第三方MimeType.他们为什么在那里?

Also how is one supposed to manually edit tika-mimetypes.xml ? It has 1290 MimeTypes with mostly ridiculous third party MimeTypes. Why are they there ?

推荐答案

如果您只想接受某些类型,那么您仍然需要完整的mimetypes集.否则,您如何还能检测到某人刚给您的文件实际上是MP3,而不是您批准的格式之一?因此,请保留完整的mimtypes设置以进行检测

If you want to only accept certain types, then you'll still want the full mimetypes set. Otherwise, how else can you detect that the file someone's just given you is in fact a MP3, and not one of your approved formats? So, keep the full mimtypes set for detecting

一旦完成检测步骤,并确定它是有效的模仿类型,就可以将文件传递给AutoDetectParser并完成它.毕竟,您要检查检测器返回的模仿类型,如果不是您喜欢的模仿类型,则可以立即进行救助.

Once you've done the detection step, and you've decided it's a valid mimetype, you could just pass the file on to the AutoDetectParser and be done with it. After all, you'd check the mimetype returned by the detector and bail out already if it isn't one you like.

但是,如果您想进行额外的检查,可以通过两种方法进行.一种是拥有一个自定义的org.apache.tika.parser.Parser文件,该文件仅列出要使用的格式的解析器.这是用于确定哪些解析器可用于AutoDetectParser的配置文件,因此,例如,如果您从该列表中删除了MP3Parser,则自动检测解析器将停止处理MP3.

However, if you want an extra check, there are two ways to do it. One is to have a custom org.apache.tika.parser.Parser file, which only lists the parsers for the formats you want to have used. This is the config file that's used to decide which parsers to make available to the AutoDetectParser, so if for example you removed the MP3Parser from that list, then the auto detect parser would stop handling MP3.

另一种方法是只显示要支持的解析器的列表.然后,而不是使用自动检测解析器,而是简单地遍历所有解析器,直到找到一个能够处理该文件的解析器,然后直接在其上调用parse方法.这将为您提供最大的控制,但可能需要更多的工作.

The other way is just to have an explicit list of the parsers you wish to support. Then, rather than using the auto detect parser, simple iterate through all of them until you get to one that is able to work on the file, and directly call the parse method on that. This will give you the most contol, but possibly with slightly more work.

这篇关于如何为几种文档类型正确配置Apache Tika?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆