如何检测文件是 PDF 还是 TIFF? [英] How to detect if a file is PDF or TIFF?

查看:31
本文介绍了如何检测文件是 PDF 还是 TIFF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请耐心等待,因为我在不了解所有背景的情况下被扔进了这个项目的中间.如果你有 WTF 问题,相信我,我也有.

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.

场景如下:我有一堆文件驻留在 IIS 服务器上.他们没有文件扩展名.只是名称为asda-2342-sd3rs-asd24-ut57"等的裸文件.没有什么直观的.

Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.

问题是我需要在 ASP.NET (2.0) 页面上提供文件并将 tiff 文件显示为 tiff,将 PDF 文件显示为 PDF.不幸的是,我不知道哪个是哪个,我需要能够以各自的格式适当地显示它们.

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.

例如,假设我需要显示 2 个文件,一个是 tiff,一个是 PDF.该页面应该显示一个 tiff 图像,可能还有一个可以在新选项卡/窗口中打开 PDF 的链接.

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

问题:

由于这些文件都是无扩展名的,我不得不强制 IIS 将所有内容都作为 TIFF 提供.但如果我这样做,PDF 文件将不会显示.对于未知的文件扩展名,我可以更改 IIS 以强制 MIME 类型为 PDF,但我会遇到相反的问题.

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.

http://support.microsoft.com/kb/326965

这个问题是不是比我想象的要容易,还是像我预期的那么糟糕?

Is this problem easier than I think or is it as nasty as I am expecting?

推荐答案

好吧,很多人都弄错了,我要发布一些我必须识别 TIFF 的代码:

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

为了得到这个,我破解了一些更通用的代码.

I hacked apart some much more general code to get this.

对于 PDF,我的代码如下所示:

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

现在,GetToken() 是对扫描仪的调用,可将 Stream 标记为 PDF 标记.这很重要,所以我不打算把它贴在这里.我使用标记器而不是查看子字符串来避免这样的问题:

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

上面的代码片段将此代码标记为 NOT a PDF,而更简单的代码块会错误地将其标记为 PDF.

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

我还应该指出,当前的 ISO 规范没有以前 Adob​​e 拥有的规范中的实施说明.最重要的是来自 PDF 参考,1.6 版:

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

这篇关于如何检测文件是 PDF 还是 TIFF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆