使用 Apache tika 获取 MimeType 子类型 [英] Getting MimeType subtype with Apache tika

查看:68
本文介绍了使用 Apache tika 获取 MimeType 子类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于 odt、ppt、pptx、xlsx 等文档,我需要获取 iana.org MediaType 而不是 application/zip 或 application/x-tika-msoffice.

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

如果您查看 mimetypes.xml,则有 mimeType 元素由 iana.org mime-type 和sub-class-of"组成

If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

如何获取 iana.org mime-type 名称而不是父类型名称?

How to get the iana.org mime-type name instead of the parent type name ?

在测试 mime 类型检测时,我这样做:

When testing mime type detection, I do :

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

测试结果:

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

有没有办法从 mimetypes.xml 获取实际的子类型?而不是 x-tika-msoffice 或 application/zip ?

Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?

此外,我从来没有得到 application/x-tika-ooxml,而是得到了 xlsx、docx、pptx 文档的 application/zip.

Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.

推荐答案

tika-core 中的默认字节模式检测规则只能检测所有 MS Office 文档类型使用的通用 OLE2 或 ZIP 格式.您想使用 ContainerAwareDetector 进行这种检测 afaik.并使用 MimeTypes 检测器作为它的回退检测器.试试这个:

The default byte pattern detection rules in tika-core can only detect the generic OLE2 or ZIP format used by all MS Office document types. You want to use ContainerAwareDetector for this kind of detection afaik. And use MimeTypes detector as its fallback detector. Try this :

public MediaType getContentType(InputStream is, String fileName) {
    MediaType mediaType;
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, fileName);
    Detector detector = new ContainerAwareDetector(tikaConfig.getMimeRepository());

    try {
        mediaType = detector.detect(is, md);
    } catch (IOException ioe) {
        whatever;
    }
    return mediaType;
}

这样你的测试应该通过

这篇关于使用 Apache tika 获取 MimeType 子类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆