从PDF文件中删除所有文本 [英] Remove all text from PDF file
问题描述
我正在使用Ghostscript将源PDF文件转换为PNG图像数组.在将PDF页面转换为PNG图像之前,我需要从PDF中提取(删除)所有文本,以便转换后的页面图像将包含除文本之外的所有其他元素.
I am using Ghostscript to convert source PDF file into array of PNG images. Before I convert PDF page into PNG image I would need to extract (delete) all text from PDF so that converted page image would contain all other elements, excluding text.
我可以使用Ghostscript实现此功能,还是需要研究其他工具?
Can I achieve this with Ghostscript or will I need to look into different tools?
我还对一种可以读取并保存我的源PDF并删除所有文本的工具感兴趣.
I would also be interested in a tool that can read-save my source PDF removing all the text.
推荐答案
自从我上次回答以来,开发一直在继续,并且现在有一个新选项可供使用,以证明有一个新答案.
Since my previous answer, development has continued, and a new option is available now, which justifies a new answer.
最新版本的Ghostscript支持3个新参数,使您可以从PDF中删除所有TEXT或所有IMAGE或所有VECTOR元素.
The most recent versions of Ghostscript support 3 new parameters, which allow you to remove either all TEXT, or all IMAGE or all VECTOR elements from a PDF.
要从输入的PDF中删除所有TEXT元素,请运行
To remove all TEXT elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
要从输入的PDF中删除所有光栅图像元素,请运行
To remove all raster IMAGE elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
要从输入的PDF中删除所有VECTOR元素,请运行
To remove all VECTOR elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
当然,您也可以组合以上两个参数中的任何一个(将所有三个参数组合在一起将创建空白页.
Of course, you can also combine any of above two parameters (combining all three will create empty pages.
这是PDF页面的屏幕截图,其中原始页面包含所有三个元素,而结果页面看上去不同.
Here are screenshots of a PDF page, where the original contained all three elements whereas the resulting pages look different.
原始PDF页面的屏幕截图,其中包含图像",矢量"和文本"元素.
Screenshot of original PDF page containing "image", "vector" and "text" elements.
运行以下6条命令将创建剩余内容的所有6种可能的变体:
Running the following 6 commands will create all 6 possible variations of remaining contents:
gs -o noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
gs -o noTXT.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
gs -o noVCT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf
gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
下图说明了结果:
顶行,从左起:删除了所有文本";删除所有图像";删除所有向量". 底部一行:从左开始:仅保留文本";仅保留图像";仅保留向量".
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
这篇关于从PDF文件中删除所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!