如何确定第1页的线性化的PDF文件的范围内(以字节为单位)? [英] How can I determine the extent (in bytes) of page 1 in a linearized PDF file?

查看:352
本文介绍了如何确定第1页的线性化的PDF文件的范围内(以字节为单位)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我可以'线性化'的PDF文件,例如使用的Acrobat SDK或使用商业工具。这也被称为用于网络优化的,并将其重新安排在PDF,使得页1可以尽可能快地加载。 。PDF文件以这种方式显示更迅速,因为PDF浏览器不必等待下载整个PDF服



更新:基于以下答案,我现在认识到,一个线性PDF不只是重新安排,也包含元数据自身的结构,在线性字典的形式。



我有,我想在期待,以预读多PDF文件(一个查询的结果),用户会希望看到他们中的一个应用程序。这将是真棒,如果我的客户可以下载第1页,只有第1页,每个搜索结果。当用户选择其中之一,第1页,可以立即显示,其余的可以在后台中下载。



我在寻找可以使用的服务器端(Windows或Linux)预处理我的PDF文件,这样我可以存储和提供第1页和一般的解决方案单独其余部分。说真的,我需要知道的是在PDF是正确显示第1。如果我能有这个号码所需要的最后一个字节,一切如下。



我浏览过的为PDF 但文件格式ISO规范似乎太复杂,我简单地分析出在哪里页面1结束。在另一方面,线性化PDF文件的工具必须几乎可以肯定知道第1页结束。



我在作品与客户服务PDF的并发症不感兴趣;这部分已经解决了,因为客户端是一个应用程序,而不是浏览器,我有完全的控制。



我不认为这会帮助我进行分裂使用工具,如 AP拆分成第1页PDF和完整的PDF文件。如果我这样做,那么我将无法愚弄客户端浏览器以为它是一个PDF文件,而且会有明显的闪烁,当我更换了与完整的PDF的第1页PDF。



任何帮助或指针赞赏。



解决方案(基于以下Bobrovsky的回答):



一个正常线性PDF始于标题行(在PDF规范的第7.5.2节中定义),如%PDF-1.7,其次是评论至少四个二进制字符的线(定义为128或更高的字节值)。例如:



<预类=郎无prettyprint-覆盖> %PDF-1.7
%¤¤¤¤

这头后面紧跟着的线性字典(在PDF规范的附录F定义)。举个例子:



<预类=郎无prettyprint-覆盖> 43 0 OBJ
<< /线性1.0%版本
/ L 54567%的文件长度
/ H [475 598]%,主要提示流偏移量和长度(第五部分)
/ O的第一页的页面的45%的对象数在文件
/ T 52786%的网页对象(第6部分)
/ E 5437%,抵销第一页
月底/ N 11%偏置号主交叉引用表第一个条目(第11部分)
>>
endobj

在这个例子中,第一页的端部是在字节偏移5437。这个数据结构是很简单的使用几乎任何语言来解析。在43 0 OBJ的事情给出了这本词典(43)和一代数(总是线性文件零)的ID。 <保证辞典本身以<包围;和>>,它们之间是键值对(键有像/ E斜线)。



和下面是发现使用正则表达式的相关数量C#方法:



<预类=郎-CS prettyprint -override> 公众诠释GetPageOneLength(字节[]数据)
{
//根据ISO PDF规范的线性参数词典应完全包含在第一个1024字节内PDF文件(第679)
串序言=新的字符串(ASCIIEncoding.ASCII.GetChars(数据,0,1024))的; //注意在标题的2线二进制部分将完全转化为质疑martks(?)
VAR匹配= Regex.Match(序言,@<< \w * /线性+ / E\s +(小于?偏移GT; \d +)。+ GT;>中);
如果(match.Success!)抛出新InvalidDataException(有PDF不正确的线性化词典);
返回int.Parse(match.Groups [抵消]值。);
}

请注意Bobrovsky的警告,一个文件可能包含线性词典,但不得适当线性(可能是因为增量编辑的?)。于我而言,这不是一个问题,因为我将线性所有的PDF文件喽。


解决方案

线性词典来帮助你完成这一点。



包含电子参数所需的字典,是




的第一页(第6部分实施例
F.1结束时),相对于文件开头的端部的偏移




请注意,不是每个用线性化的字典文件实际上是线性所以,你可能无法使用(破发电机,线性化等变化后)描述的方法,如果你的文件不被验证为正确线性化。



请看看的 F.2.2线性参数词典(第2部分)的在关于线性字典更多信息PDF参考。


I know that I can 'linearize' a PDF file, for example using the Acrobat SDK or using commercial tools. This is also called 'optimized for web', and it rearranges the PDF so that page 1 can load as quickly as possible. PDFs served in this way are displayed more quickly, because the PDF viewer doesn't have to wait for the whole PDF to be downloaded.

Update: based on answer below, I now realize that a linearized PDF is not just rearranged, but also contains metadata about its own structure, in the form of the "linearization dictionary".

I have an application where I want to prefetch several PDFs (the results of a query) in anticipation that the user will want to see one of them. It would be awesome if my client could download page 1, and only page 1, for each of the search results. When the user selects one of them, page 1 can be displayed instantly, and the remainder can be downloaded in the background.

I'm looking for a general solution that can be used server-side (Windows or Linux) to preprocess my PDFs, so that I can store and serve page 1 and the remainder separately. Really, all I need to know is where in the PDF is the last byte needed to properly display page 1. If I can have this number, all else follows.

I have browsed the ISO specification for PDF but the file format seems too complex for me to simply parse out where page 1 ends. On the other hand, the tools that linearize PDFs must almost certainly know where page 1 ends.

I am not interested in the complications of serving PDFs in pieces to the clients; this part is already solved since the client is an app, not a browser, and I have full control.

I also don't think it will help me to split the PDF using tools like AP Split into a "page 1" PDF and a complete PDF. If I do, then I will not be able to fool the client viewer into thinking it is a single PDF file, and there will be noticeable flicker when I replace the "page 1" PDF with the full PDF.

Any help or pointers appreciated.

Solution (based on Bobrovsky's answer below):

A properly linearized PDF begins with a header line (defined in section 7.5.2 of the PDF spec) such as "%PDF-1.7" followed by a comment line of at least four binary characters (defined as byte values of 128 or higher). For example:

    %PDF-1.7
    %¤¤¤¤

This header is immediately followed by the linearization dictionary (defined in Appendix F in the PDF spec). An example:

    43 0 obj
    << /Linearized 1.0 % Version
     /L 54567   % File length
     /H [475 598] % Primary hint stream offset and length (part 5)
     /O 45      % Object number of first page’s page object (part 6)
     /E 5437    % Offset of end of first page
     /N 11      % Number of pages in document
     /T 52786 % Offset of first entry in main cross-reference table (part 11)
    >>
    endobj

In this example, the end of the first page is at byte offset 5437. This data structure is simple enough to parse using pretty much any language. The "43 0 obj" thing gives an ID for this dictionary (43) and a generation number (always zero for linearized files). The dictionary itself is surrounded by << and >>, between which are key value pairs (keys have slashes like "/E").

And here's a C# method that finds the relevant number using a regex:

public int GetPageOneLength(byte[] data)
{
  // According to ISO PDF spec: "The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file" (p. 679)
  string preamble = new string(ASCIIEncoding.ASCII.GetChars(data, 0, 1024));    // Note that the binary section on line 2 of the header will be entirely converted to question martks ('?')
  var match = Regex.Match(preamble, @"<<\w*/Linearized.+/E\s+(?<offset>\d+).+>>");
  if (!match.Success) throw new InvalidDataException("PDF does not have a proper linearization dictionary");
  return int.Parse(match.Groups["offset"].Value);
}

Note Bobrovsky's caveat that a file may contain the linearization dictionary, yet may not be properly linearized (perhaps because of an incremental edit?). In my case, this is not a problem, as I will linearize all the PDFs myself.

解决方案

Linearization dictionary should help you with this.

The dictionary required to contain E parameter that is

The offset of the end of the first page (the end of part 6 in Example F.1), relative to the beginning of the file.

Please note that not every file with a linearization dictionary is actually linearized (broken generators, changes after linearization etc.) So, you might not be able to use described approach if your files are not verified to be properly linearized.

Please have a look at F.2.2 Linearization Parameter Dictionary (Part 2) in PDF Reference for more information about linearization dictionary.

这篇关于如何确定第1页的线性化的PDF文件的范围内(以字节为单位)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆