如何从文件中准确确定MIME数据? [英] How to accurately determine mime data from a file?

查看:205
本文介绍了如何从文件中准确确定MIME数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在向程序中添加一些功能,以便可以通过读取MIME数据来准确地确定文件类型.我已经尝试了几种方法:

I'm adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I've already tried a few methods:

方法1:

javax.activation.FileDataSource

FileDataSource ds = new FileDataSource("~\\Downloads\\777135_new.xls");  
String contentType = ds.getContentType();  
System.out.println("The MIME type of the file is: " + contentType);

//output = The MIME type of the file is: application/octet-stream

方法2:

import net.sf.jmimemagic.*;

try
{
    RandomAccessFile f = new RandomAccessFile("~\\Downloads\\777135_new.xls", "r");
    byte[] fileBytes = new byte[(int)f.length()];
    f.read(fileBytes);
    MagicMatch match = Magic.getMagicMatch(fileBytes);
    System.out.println("The Mime type is: " + match.getMimeType());
}
catch(Exception e)
{
    System.out.println(e);
}

//output = The Mime type is: application/msword

方法3:

import eu.medsea.mimeutil.*;

MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
File f = new File ("~\\Downloads\\777135_new.xls");
Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
String mimeType = MimeUtil.getFirstMimeType(mimeTypes.toString()).toString();
String subMimeType = MimeUtil.getSubType(mimeTypes.toString());
System.out.println("The Mime type is: " + mimeTypes + ", " + mimeType + ", " + subMimeType);

//output = The Mime type is: application/msword, application/msword, msword

我在 http://www.rgagnon.com/javadetails/上找到了这三种方法java-0487.html .但是我的问题是我正在测试这些方法的文件是我创建的文件,所以我知道它是一个Excel文件,但是除第一个方法(我认为这是由于该方法使用的内置FileTypeMap中的文件类型数量有限.

I found these three methods at http://www.rgagnon.com/javadetails/java-0487.html. However my problem is that the file I am testing these methods on is one I created and so I know it's an Excel file, but still all three methods are incorrectly picking up the type as msword except the first method which I believe is because of the limited number of file types in the built in FileTypeMap that the method uses.

我环顾四周,有人说这是因为在文件中检测到偏移量的方式,所以内容类型被错误地拾取,如本

I've had a look around and some people say that it's because the way the offset is detected in the files and so the content type is picked up incorrectly, as pointed out in this wiki on detecting file types in PHP. Unfortunately the wiki then goes on to use the extension to determine the file type which isn't what I want to do as it's unreliable.

任何人都可以向我指出一种正确的方法,该方法可以在Java中正确检测文件类型吗?

Can anyone point me in the right direction to a method that will detect the file types correctly within Java please?

干杯, 阿列克谢·蓝.

Cheers, Alexei Blue.

好像没有具体的解决方案,如@IronMensan在下面的评论中所述.我确实发现了这个非常有趣的研究论文,它在一些机器学习中应用了机器学习解决问题的方法,但似乎没有完整的答案.我认为我最好的选择是尝试将文件传递给excel文件阅读器,并捕获任何不正确的格式异常.

Looks like there is no specific solution to this as @IronMensan said in the comment below. I did find this really interesting research paper that applies machine learning in a few ways to help the issue but there doesn't seem to be a full proof answer. I think my best bet here will be to try and pass the file to an excel file reader and catch any incorrect format exceptions.

推荐答案

如注释中所述,由于存在多种可能的文件类型,因此所有可能的文件都可能被命中和遗失,但是您可能知道您所使用的文件类型通常将要处理.这个出色的幻数列表最近帮助我对您提到的特定办公格式进行了检测(搜索Microsoft Office),您会看到MS Office文件类型具有指定的子类型(该子类型位于文件的更深处),并可以让您明确地确定所拥有的文件类型.许多新格式(例如ODT,DOCX,OOXML等)都使用ZIP文件来保存其数据,因此您可能需要先检测zip,然后查找详细信息.

As mentioned in the comments since there's so many possible file types it could be hit and miss for ALL possibile files, but you probably know the types of files you are typically going to be dealing with. This excellent list of magic numbers has helped me do detection recently around the specific office formats you mentioned (search for Microsoft Office) and you'll see that the MS office file types have a sub-type specified (which is further into the file) and lets you work out specifically which type of file you have. Many new formats like ODT, DOCX, OOXML etc use a ZIP file to hold their data so you might need to detect zip first, then look for specifics.

这篇关于如何从文件中准确确定MIME数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆