MIMETYPE检查使用提卡罐子 [英] Mimetype check using Tika jars

查看:289
本文介绍了MIMETYPE检查使用提卡罐子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开发标准独立Java批处理过程。我试图确定使用提卡瓶文件附件的MIME类型。我使用提卡1.4 JAR文件。

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.

我的code样子

Parser parser= new AutoDetectParser();
InputStream stream = new FileInputStream(fileAttachment);
int writerHandler =-1;
ContentHandler contentHandler= new BodyContentHandler(writerHandler);
Metadata metadata= new Metadata();
parser.parse(stream, contentHandler, metadata, new ParseContext());
String mimeType = metadata.get(Metadata.CONTENT_TYPE);
logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

这code不能正常工作办公室03和07的文件。

This code is not working properly for the office 03 and 07 documents.

虽然从Eclipse中运行我得到正确的MIME类型。

我建立的jar文件,并从命令运行它给人错误的MIME类型。

out put from command
------------
File Attachment: Testpdf.pdf  MimeType is: application/pdf
File Attachment: Testpdf.tif  MimeType is: image/tiff
File Attachment: Testpdf.xlsx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xltx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.pptx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.docx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xls  MimeType is: application/zip
File Attachment: Testpdf.doc  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.dot  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.ppt  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.xlt  MimeType is: application/vnd.ms-excel

我试着用OfficePraser,OOXMLParser。它不工作。
我曾尝试与蒂卡0.9 jar文件。 MIME类型来了正常,但如果我的文件附件中的任何一个可编辑的PDF我的批处理过程中死去(如出口(0);在code)。
如果我有新的蒂卡罐子它给人错误的MIME类型。

I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.

请帮我在这。先谢谢了。

Please help me in this. Thanks in advance.

CVSR萨尔马

推荐答案

首先,你使用Apache提卡的错位。如果你想知道的是文件类型,那么你应该使用检测API 的javadoc )直接,例如:

Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:

TikaConfig tika = new TikaConfig();

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String mimetype = tika.getDetector().detect(stream, metadata);

如果你对你的类路径只有在蒂卡芯罐子,然后将上面的检测将使用MIME格式魔术和文件名提示。这会让它得到大多数文件,特别是如果他们有正确的扩展,但就很难只错一个名为容器格式

If you have only the tika-core jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"

容器格式就像拉链,OLE2等,其中一个文件格式可以容纳多种类型的东西(如ODS,XLSX,主题全部使用的.zip,.doc和.xls的都使用OLE2)。如果你想要做的检测,看起来里面更精确的结果容器,您还需要包括蒂卡分析器罐子和它的依赖。

Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parser jar and its dependencies.

请注意,由于中的Javadoc <解释/一>,你流需要支持mark和reset检测工作。这是为了让提卡可以读取数据流的第一位,看看它制定出您的文件是什么,然后返回流,它是如何用于其他用途(如解析)。大多数流应该的,但如果你没有,最简单的方法解决它是包装在一个的 TikaInputStream 通过<一个href=\"http://tika.apache.org/1.10/api/org/apache/tika/io/TikaInputStream.html#get%28java.io.InputStream%29\"相对=nofollow> TikaInputStream.get ,这种种一切为你

Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you

这篇关于MIMETYPE检查使用提卡罐子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆