提取PDF的目录? [英] Extract TOC of PDF?

查看:206
本文介绍了提取PDF的目录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将借助SWFTools和XPDF将pdf提取到图像/swf和文本中.我正在PDF脚本中运行这些文件.

I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.

但是现在我试图更进一步,尝试从PDF中获取TOC,是否可以提取此信息?

But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?

推荐答案

我通过一点搜索就发现了这一点.看起来很有希望.

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

注意:该工具基于Python,但是您应该能够通过外壳访问来使用该工具.另外,由于该项目是开源的,因此您可以从源代码本身中收集一些有用的信息.

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

从站点上:

dumppdf.py

dumppdf.py以伪XML格式转储PDF文件的内部内容.该程序主要用于调试目的,但也可以提取一些有意义的内容(例如图像).

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

示例:

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

这篇关于提取PDF的目录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆