从PDF提取实际的文本标题 [英] Extracting the actual in-text title from a PDF
问题描述
关于从PDF中提取标题(使用其元数据)似乎存在很多问题.但是,大多数标题似乎不存在于元数据中.我在使用 http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf时发现了这一点.html .
There seems to be a lot of questions about extracting a title from a PDF (using its metadata). However, the large majority of the titles do not seem to exist in the metadata. I found this out when using http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html .
无论如何,实际上可以从pdf中检索文本标题吗?我试图导出到文本文件然后搜索,但是没有一致的格式.有什么方法可以将pdf格式的文档导出到文档,然后检查字体大小> = 14吗?
Is there anyway to actually retrieve the in text title from a pdf? I tried to export to a text file then search but there is no consistent formatting. Is there any way to export the pdf to a document with its formatting, then check for a font size >= 14 ?
推荐答案
这是一个很好的问题.创建PDF的应用程序似乎对可用的元数据字段没有做任何有用的事情.
This is a very good question. Applications that create PDFs don't seem to do anything useful with the available metadata fields.
以 pdflatex 为例:即使设置了 \ title {...} 和 \ author {...} 在序言中,此信息未反映在元数据中.快速搜索后,解决方案似乎是在序言中引入一个块,该块仅由 pdflatex [1]读取:
Take pdflatex as an example: even when one sets the \title{...} and \author{...} in the preamble, this information is not reflected in the metadata. After a quick search, the solution appears to be to introduce a block in the preamble which is read only by pdflatex [1]:
\pdfinfo
{
/Title{...}
/Author{...}
...
}
...,然后将其放置在PDF的相关元数据字段中.奇怪的是,这是必需的.
...which is then placed in the the relevant metadata fields of the PDF. It is strange that this is necessary, though.
我不能代表Word或Writer之类的文字处理器.假定此类元数据字段必须由用户手动设置.
I cannot speak for word processors like Word or Writer. One presumes such metadata fields have to be set manually by the user.
如果PDF不是由您生成的,也许启发式方法是解决问题的唯一方法. [2]看起来它的功能与您想要的类似,但是我想这取决于PDF的发布程度-该工具似乎是面向科学论文的.
Perhaps a heuristic approach is the only way you can approach your problem if your PDFs are not generated by you. [2] seems like it does something similar to what you want, but I guess it depends how well published the PDFs are -- this tool seems to be scientific-paper oriented.
我希望至少能有所帮助.
I hope that is at least some help.
[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php
[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php
这篇关于从PDF提取实际的文本标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!