使用 Tika jar 进行 Mimetype 检查 [英] Mimetype check using Tika jars

查看:57
本文介绍了使用 Tika jar 进行 Mimetype 检查的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在单独开发标准的 Java 批处理.我正在尝试使用 Tika Jars 确定文件附件 mimetype.我使用的是 Tika 1.4 Jar 文件.

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.

我的代码看起来像

Parser parser= new AutoDetectParser();
InputStream stream = new FileInputStream(fileAttachment);
int writerHandler =-1;
ContentHandler contentHandler= new BodyContentHandler(writerHandler);
Metadata metadata= new Metadata();
parser.parse(stream, contentHandler, metadata, new ParseContext());
String mimeType = metadata.get(Metadata.CONTENT_TYPE);
logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

此代码不适用于 office 03 和 07 文档.

This code is not working properly for the office 03 and 07 documents.

从 eclipse 运行时,我得到了正确的 mimetypes.

我构建了 jar 文件并从命令运行它给出了错误的 mimetypes.

out put from command
------------
File Attachment: Testpdf.pdf  MimeType is: application/pdf
File Attachment: Testpdf.tif  MimeType is: image/tiff
File Attachment: Testpdf.xlsx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xltx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.pptx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.docx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xls  MimeType is: application/zip
File Attachment: Testpdf.doc  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.dot  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.ppt  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.xlt  MimeType is: application/vnd.ms-excel

我尝试过 OfficePraser、OOXMLParser.它不工作.我尝试过使用 tika 0.9 jar 文件.mimeTypes 是正确的,但如果我的文件附件中的任何一个是可编辑的 pdf",我的批处理就会死亡(如代码中的exit(0);").如果我有新的 tika jar,它会给出错误的 mimeTypes.

I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.

请帮我解决这个问题.提前致谢.

Please help me in this. Thanks in advance.

CVSR 萨尔玛

推荐答案

首先,您使用了错误的 Apache Tika.如果您只想知道文件类型,那么您应该使用 Detection API (javadocs),例如:

Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:

TikaConfig tika = new TikaConfig();

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String mimetype = tika.getDetector().detect(stream, metadata);

如果您的类路径上只有 tika-core jar,那么上面的检测将使用 Mime Magic 和文件名提示.这将让它获得大多数文件,特别是如果它们具有正确的扩展名,但它只会在错误命名的容器格式"中挣扎

If you have only the tika-core jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"

容器格式是 zip、ole2 等,其中一种文件格式可以包含多种类型(例如 ods、xlsx、keynote 都使用 .zip,.doc 和 .xls 都使用 ole2).如果你想在容器内部进行检测以获得更准确的结果,你还需要包含 tika-parser jar 及其依赖项.

Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parser jar and its dependencies.

请注意,正如 在 Javadoc 中所述,您的流需要支持标记和重置才能进行检测.这是为了让 Tika 可以读取您的流的第一位,查看它以确定您的文件是什么,然后将流返回到它准备用于其他用途(例如解析)的状态.大多数流应该,但如果你的没有,最简单的修复方法是将它包装在 TikaInputStream 通过 TikaInputStream.get,它为您整理了所有内容

Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you

这篇关于使用 Tika jar 进行 Mimetype 检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆