使用Apache tika获取MimeType子类型 [英] Getting MimeType subtype with Apache tika

查看:1945
本文介绍了使用Apache tika获取MimeType子类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于像odt,ppt,pptx,xlsx等文档,我需要获取iana.org MediaType而不是application / zip或application / x-tika-msoffice。

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

如果查看mimetypes.xml,有mimeType元素由iana.org mime-type和sub-class-of组成

If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

如何获取iana.org mime-type名称而不是父类型名称?

How to get the iana.org mime-type name instead of the parent type name ?

在测试mime类型检测时,我这样做:

When testing mime type detection, I do :

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

测试结果:

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

有没有办法从mimetypes.xml获取实际的子类型?而不是x-tika-msoffice或application / zip?

Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?

此外我从来没有获得application / x-tika-ooxml,但xlsx,docx,pptx文件的应用程序/ zip 。

Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.

推荐答案

tika-core中的默认字节模式检测规则只能检测所有MS Office使用的通用OLE2或ZIP格式文件类型。您希望使用ContainerAwareDetector进行此类检测。并使用MimeTypes检测器作为其后备检测器。试试这个:

The default byte pattern detection rules in tika-core can only detect the generic OLE2 or ZIP format used by all MS Office document types. You want to use ContainerAwareDetector for this kind of detection afaik. And use MimeTypes detector as its fallback detector. Try this :

public MediaType getContentType(InputStream is, String fileName) {
    MediaType mediaType;
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, fileName);
    Detector detector = new ContainerAwareDetector(tikaConfig.getMimeRepository());

    try {
        mediaType = detector.detect(is, md);
    } catch (IOException ioe) {
        whatever;
    }
    return mediaType;
}

这样你的测试应该通过

这篇关于使用Apache tika获取MimeType子类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆