如何检测如果文件是PDF或TIFF? [英] How to detect if a file is PDF or TIFF?

查看:119
本文介绍了如何检测如果文件是PDF或TIFF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请容忍我,因为我已经被扔进这个项目中间不知道所有的背景。如果你有问题WTF,相信我,我有他们。

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.

下面是情景:我有一大堆驻留在IIS服务器上的文件。他们有没有对他们的文件扩展名。只是赤裸裸的文件,如ASDA-2342-sd3rs-asd24-ut57等名称。没有直观。

Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.

问题是我需要来提供文件的ASP.NET(2.0)页面上,并显示TIFF文件为TIFF和PDF文件为PDF。不幸的是,我不知道哪个是哪个,我需要能够在各自的格式正确显示它们。

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.

举例来说,假设有2个文件,我需要显示,一种是TIFF,一个是PDF。该页面应该显示了一个TIFF图像,也许一个链接,将在新标签/窗口中打开PDF文件。

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

问题:

由于这些文件都是无扩展名的,我不得不强制IIS服务只是一切行动为TIFF。但是,如果我这样做,PDF文件将不会显示。我可以改变IIS强制MIME类型是未知的文件扩展名的PDF,但我有相反的问题。

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.

http://support.microsoft.com/kb/326965

这是问题比较容易,比我想还是因为讨厌,因为我期待?

Is this problem easier than I think or is it as nasty as I am expecting?

推荐答案

OK,足够多的人收到此错误,我要去发布一些code我有识别TIFF格式:

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

我砍死除了一些更具通用code得到这个。

I hacked apart some much more general code to get this.

有关PDF,我有code,看起来像这样:

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

现在,为gettoken()是调用成tokenizes流转换为PDF令牌的扫描仪。这是不平凡的,所以我不打算在这里粘贴。我使用的标记生成器,而不是看着子来避免这样的问题:

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

这code被标记为不可一上述code片段PDF,而code的更简单的块会错误地标记它作为一个PDF。

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

我还要指出的是,当前的ISO规范是缺乏的实现指出,分别在previous的Adobe拥有的规范。更重要的是从PDF参考,1.6版:

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

这篇关于如何检测如果文件是PDF或TIFF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆