从PDF中提取文本格式（字体大小，类型等） [英] Extract text from PDF in respect to formatting (font size, type etc)

查看：672 发布时间：2018/1/6 23:09:02 pdf fonts styles extract font-size

本文介绍了从PDF中提取文本格式（字体大小，类型等）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以从PDF文件中提取有关特定字体/字体大小/字体颜色等的文本？我更喜欢Perl，Python或* nix命令行工具。我的目标是从PDF文件中提取所有标题，这样我就可以得到包含在单个PDF中的文章的很好的索引。 /字体/字体大小/位置（没有颜色，因为我检查），你可以从Ghostscript的txtwrite设备（尝试-dTextFormat = 0 | 1选项），以及从mudraw（MuPDF）与-tt选项获得。然后用例如XML解析类似XML的输出。 Perl。

Is possible to extract text from PDF file in respect to specific font/font size/font color etc.? I prefer perl, python or *nix command line utilities. My goal is to extract all headlines from PDF file so I will have nice index of articles contained in single PDF.

解决方案

Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.

这篇关于从PDF中提取文本格式（字体大小，类型等）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从PDF中提取文本格式（字体大小，类型等） [英] Extract text from PDF in respect to formatting (font size, type etc)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从PDF中提取文本格式（字体大小，类型等） [英] Extract text from PDF in respect to formatting (font size, type etc)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭