如何使用Python从PDF删除文本 [英] How to erase text from PDF using Python

查看:1080
本文介绍了如何使用Python从PDF删除文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个python脚本来编辑PDF中的文本.

I'm creating a python script to edit text from PDFs.

我有这个Python代码,可以将文本添加到PDF文件的特定位置.

I have this Python code which allows me to add text into specific positions of a PDF file.

import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys

packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages 
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
    output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()

我的问题:我想用自定义文本替换原始PDF中特定位置的文本.可以使用写空白字符的方法来解决问题,但我找不到能做到这一点的东西.

My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.

PS.:它必须是Python代码,因为稍后我需要将其部署为 .exe 文件,而我只知道如何使用Python代码来实现.

PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.

推荐答案

用于替换PDF中文本的通用算法是一个难题.我并不是说它永远不可能完成,因为我已经用Adobe PDF Library演示了这样做,尽管它的输入文件非常简单,没有任何复杂性,但是我不确定pyPDF2是否具有执行此操作所需的功能.所以.在某种程度上,仅查找文本可能是一个挑战.

A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.

您(或更实际地,您的PDF库)必须解析页面内容并跟踪图形状态的变化,特别是在文本位于Form XObject中的情况下,特别是对当前转换矩阵的更改,以及文本转换矩阵,并更改字体;您必须使用字体资源来获取字符宽度,以弄清楚插入字符串后文本光标的位置.您可能需要处理标准14字体,这些字体在其字体资源中不包含该信息(应用程序-您的程序-应该知道它们的度量标准)

You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)

毕竟,如果您不需要将Tj或TJ(显示文本)指令分解为不同的部分,则删除文本很容易.如果需要的话,要防止移位后的文本,可能需要插入一条新的Tm指令以将文本重新定位到原来的位置.

After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.

插入新文本可能具有挑战性.如果要与所使用的字体保持一致,并且该字体是嵌入字体和子集,则它不一定包含您插入文本所需的字形.在插入之后,您必须决定是否需要对插入的文本之后的文本进行重排.

Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.

最后,您将需要PDF库来保存所有更改.坦率地说,使用Adobe Acrobat的Redaction功能可能比尝试从头开始编写程序更便宜,并且更具成本效益.

And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

这篇关于如何使用Python从PDF删除文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆