以编程方式更改PDF中文本的字体颜色 [英] Programmatically change font color of text in PDF

查看:212
本文介绍了以编程方式更改PDF中文本的字体颜色的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我并不熟悉PDF规范。我想知道是否可以直接操作一个PDF文件,以便某些我认为重要的文本块以我选择的颜色突出显示。选择的语言将是蟒蛇。

解决方案

这是可能的,但不一定容易,因为PDF格式非常丰富。您可以在此处找到详细描述的文档。关于PDF如何显示文本的第一个基本示例是:

$ p

$ b $ F $ 12 $ $
288 720 Td
(ABC)Tj
ET

BT和ET是命令开始和结束一个文本对象; Tf是一个命令使用外部字体资源F13(这恰好是Helvetica)大小为12; Td是将光标定位在给定坐标的命令; Tj是写入前一个字符串的字形的命令。这种风格有点反向抛光,实际上和Adobe的排版的其他贡献之一PostScript的风格相当接近。问题是,PDF规范中没有任何东西说,看起来像文本一样在页面上显示的文本必须实际上是在一起;由于精确的坐标总是可以给出的,所以如果PDF是由复杂的印刷版式系统生成的,它可以按照字符,坐标,精确地定位文本。因此,以文字和句子的形式重构文本不一定容易 - 除了给予人物精确(几乎是...),一些所谓的图像可能实际上显示为字符...; - )。



pyPdf 一个非常简单的纯Python库,这是一个很好的开始使用PDF文件的起点。它的文本提取功能是相当基本的,只是连接几个文本绘图命令的参数,你会发现在某些文档上已经足够了,而且在其他文档上是不可用的,但至少这是一个开始。作为分发,pyPdf几乎没有任何颜色,但有一些可以补救的黑客行为。



pdfminer 完全侧重于解析PDF文件;它确实做了一些集群来尝试和重构文本的情况下,更简单的图书馆将难倒。

我不知道现有的库执行转换任务欲望,但它应该是可行的混合和匹配现有的一些,以获得大部分完成...祝你好运!


I'm not familiar with the PDF specification at all. I was wondering if it's possible to directly manipulate a PDF file so that certain blocks of text that I've identified as important are highlighted in colors of my choice. Language of choice would be python.

解决方案

It's possible, but not necessarily easy, because the PDF format is so rich. You can find a document describing it in detail here. The first elementary example it gives about how PDFs display text is:

BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET

BT and ET are commands to begin and end a text object; Tf is a command to use external font resource F13 (which happens to be Helvetica) at size 12; Td is a command to position the cursor at the given coordinates; Tj is a command to write the glyphs for the previous string. The flavor is somewhat "reverse-polish notation"-oid, and indeed quite close to the flavor of Postscript, one of Adobe's other great contributions to typesetting.

The problem is, there is nothing in the PDF specs that says that text that "looks" like it belongs together on the page as displayed must actually "be" together; since precise coordinates can always be given, if the PDF is generated by a sophisticated typography layout system, it might position text precisely, character by character, by coordinates. Reconstructing text in form of words and sentences is therefore not necessarily easy -- it's almost as hard as optical text recognition, except that you are given the characters precisely (well -- almost... some alleged "images" might actually display as characters...;-).

pyPdf is a very simple pure-Python library that's a good starting point for playing around with PDF files. Its "text extraction" function is quite elementary and does nothing but concatenate the arguments of a few text-drawing commands; you'll see that suffices on some docs, and is quite unusable on others, but at least it's a start. As distributed, pyPdf does just about nothing with colors, but with some hacking that could be remedied.

reportlab's powerful Python library is entirely focused on generating new PDFs, not on interpreting or modifying existing ones. At the other extreme, pure Python library pdfminer entirely focusing on parsing PDF files; it does do some clustering to try and reconstruct text in cases in which simpler libraries would be stumped.

I don't know of an existing library that performs the transformational tasks you desire, but it should be feasible to mix and match some of these existing ones to get most of it done... good luck!

这篇关于以编程方式更改PDF中文本的字体颜色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆