如何检测如果文件是PDF或TIFF? [英] How to detect if a file is PDF or TIFF?
问题描述
请容忍我,因为我已经被扔进这个项目中间不知道所有的背景。如果你有问题WTF,相信我,我有他们。
Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.
下面是情景:我有一大堆驻留在IIS服务器上的文件。他们有没有对他们的文件扩展名。只是赤裸裸的文件,如ASDA-2342-sd3rs-asd24-ut57等名称。没有直观。
Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.
问题是我需要来提供文件的ASP.NET(2.0)页面上,并显示TIFF文件为TIFF和PDF文件为PDF。不幸的是,我不知道哪个是哪个,我需要能够在各自的格式正确显示它们。
The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.
举例来说,假设有2个文件,我需要显示,一种是TIFF,一个是PDF。该页面应该显示了一个TIFF图像,也许一个链接,将在新标签/窗口中打开PDF文件。
For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.
问题:
由于这些文件都是无扩展名的,我不得不强制IIS服务只是一切行动为TIFF。但是,如果我这样做,PDF文件将不会显示。我可以改变IIS强制MIME类型是未知的文件扩展名的PDF,但我有相反的问题。
As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.
http://support.microsoft.com/kb/326965
这是问题比较容易,比我想还是因为讨厌,因为我期待?
Is this problem easier than I think or is it as nasty as I am expecting?
推荐答案
OK,足够多的人收到此错误,我要去发布一些code我有识别TIFF格式:
OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:
private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;
private bool IsTiff(Stream stm)
{
stm.Seek(0);
if (stm.Length < kMinimumTiffSize)
return false;
byte[] header = new byte[kHeaderSize];
stm.Read(header, 0, header.Length);
if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
return false;
bool isIntel = header[0] == kIntelMark;
ushort magicNumber = ReadShort(stm, isIntel);
if (magicNumber != kTiffMagicNumber)
return false;
return true;
}
private ushort ReadShort(Stream stm, bool isIntel)
{
byte[] b = new byte[2];
_stm.Read(b, 0, b.Length);
return ToShort(_isIntel, b[0], b[1]);
}
private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
if (isIntel)
{
return (ushort)(((int)b1 << 8) | (int)b0);
}
else
{
return (ushort)(((int)b0 << 8) | (int)b1);
}
}
我砍死除了一些更具通用code得到这个。
I hacked apart some much more general code to get this.
有关PDF,我有code,看起来像这样:
For PDF, I have code that looks like this:
public bool IsPdf(Stream stm)
{
stm.Seek(0, SeekOrigin.Begin);
PdfToken token;
while ((token = GetToken(stm)) != null)
{
if (token.TokenType == MLPdfTokenType.Comment)
{
if (token.Text.StartsWith("%PDF-1."))
return true;
}
if (stm.Position > 1024)
break;
}
return false;
}
现在,为gettoken()是调用成tokenizes流转换为PDF令牌的扫描仪。这是不平凡的,所以我不打算在这里粘贴。我使用的标记生成器,而不是看着子来避免这样的问题:
Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:
% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage
这code被标记为不可一上述code片段PDF,而code的更简单的块会错误地标记它作为一个PDF。
this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.
我还要指出的是,当前的ISO规范是缺乏的实现指出,分别在previous的Adobe拥有的规范。更重要的是从PDF参考,1.6版:
I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
这篇关于如何检测如果文件是PDF或TIFF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!