从PDF提取实际的文本标题 [英] Extracting the actual in-text title from a PDF

查看:603
本文介绍了从PDF提取实际的文本标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于从PDF中提取标题(使用其元数据)似乎存在很多问题.但是,大多数标题似乎不存在于元数据中.我在使用 http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf时发现了这一点.html .

There seems to be a lot of questions about extracting a title from a PDF (using its metadata). However, the large majority of the titles do not seem to exist in the metadata. I found this out when using http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html .

无论如何,实际上可以从pdf中检索文本标题吗?我试图导出到文本文件然后搜索,但是没有一致的格式.有什么方法可以将pdf格式的文档导出到文档,然后检查字体大小> = 14吗?

Is there anyway to actually retrieve the in text title from a pdf? I tried to export to a text file then search but there is no consistent formatting. Is there any way to export the pdf to a document with its formatting, then check for a font size >= 14 ?

推荐答案

这是一个很好的问题.创建PDF的应用程序似乎对可用的元数据字段没有做任何有用的事情.

This is a very good question. Applications that create PDFs don't seem to do anything useful with the available metadata fields.

pdflatex 为例:即使设置了 \ title {...} \ author {...} 在序言中,此信息未反映在元数据中.快速搜索后,解决方案似乎是在序言中引入一个块,该块仅由 pdflatex [1]读取:

Take pdflatex as an example: even when one sets the \title{...} and \author{...} in the preamble, this information is not reflected in the metadata. After a quick search, the solution appears to be to introduce a block in the preamble which is read only by pdflatex [1]:

\pdfinfo
{
  /Title{...}
  /Author{...}
  ...
}

...,然后将其放置在PDF的相关元数据字段中.奇怪的是,这是必需的.

...which is then placed in the the relevant metadata fields of the PDF. It is strange that this is necessary, though.

我不能代表Word或Writer之类的文字处理器.假定此类元数据字段必须由用户手动设置.

I cannot speak for word processors like Word or Writer. One presumes such metadata fields have to be set manually by the user.

如果PDF不是由您生成的,也许启发式方法是解决问题的唯一方法. [2]看起来它的功能与您想要的类似,但是我想这取决于PDF的发布程度-该工具似乎是面向科学论文的.

Perhaps a heuristic approach is the only way you can approach your problem if your PDFs are not generated by you. [2] seems like it does something similar to what you want, but I guess it depends how well published the PDFs are -- this tool seems to be scientific-paper oriented.

我希望至少能有所帮助.

I hope that is at least some help.

[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php

[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php

这篇关于从PDF提取实际的文本标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆