从PDF中删除所有文本 [英] Removing all text from a PDF

查看:137
本文介绍了从PDF中删除所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列包含扫描图像的PDF,然后进行了OCR扫描。文本仍以图形方式显示 - 换句话说,扫描的图像文本仍然存在 - 并且OCR的文本在图像后面。这允许搜索文档,复制文本等。

I have a collection of PDFs that comprise of scanned images, which have then been OCR'd. The text is still displayed "graphically" - in other words, the scanned image text is still present - and the OCR'd text is "behind the image". This allows the documents to be searched, the text copied etc.

由于OS X中存在令人讨厌(现已解决)的错误,一些OCR文本是损坏。因此,我想删除PDF中的文本,然后重新OCR文档。出于许多重要的原因,我不想将将文档重新打印为PDF的路线:我宁愿尝试尽可能地修复文档。

Due to a nasty (and now resolved) bug in OS X, some of the OCR'd text is corrupted. I'd like to therefore remove the text from the PDF, and re-OCR the document. For many non-trivial reasons, I don't want to go down the "re-print the document to a PDF" route: I'd prefer to try and repair the document in-place as much as possible.

由于我找不到能满足要求的PDF工具,而且我有一点编码经验,所以我决定卷起袖子试着将一些.NET(C#)代码拼凑在一起以删除文本。

As I can't find a PDF utility that will do what I'm asking, and I have a bit of coding experience, I've decided to roll up my sleeves and try to knock together a bit of .NET (C#) code to remove the text.

我看过iTextSharp,我可以打开一个示例文档,但我在哪里卡住了就是找到(因此,删除)文档中的文本。我查看了各种不同的PDF规范文档,我很快就迷路了,我见过的iTextSharp的所有示例都涉及在文档中添加对象,图形或文本。

I've looked at iTextSharp, and I can open a sample document, but where I'm getting stuck is finding (and therefore, removing) just the text in a document. I've looked at various different PDF spec documents and I'm quickly getting lost, and all the examples I've seen for iTextSharp deal with adding objects, graphics or text to a document.

总而言之,我想要做的就是找到所有文本块并将其删除,同时单独留下图形(原始JPG)图像。任何人都可以告诉我我应该寻找的对象类型,以及我应该迭代的层次结构,以实现这一目标吗?

To summarise, all I want to do is find all the blocks of text and remove them, whilst leaving the graphic (originally JPG) images alone. Can anyone tell me what object types I should be looking for, and what hierarchy I should be iterating through, to achieve this?

推荐答案

改编如何使用PDFTK(或其他命令行应用程序)查找和替换现有PDF文件中的文本我能够使用pdftk和sed删除渲染文本。这肯定不是完全一般的,但是对我的需求是快速的黑客。

Adapting this How to find and replace text in a existing PDF file with PDFTK (or other command line application) I was able to delete the rendered text by using pdftk and sed. This is surely not fully general, but was a quick hack for my needs.

我最终得到:

pdftk my_input.pdf output - uncompress | sed -e 's/\[.*\]TJ/()Tj/' -e 's/(.*)Tj/()TJ/' | pdftk - output my_output.pdf compress

这将流转换为文本格式,我在其中找到( blah)Tj和[blah] TJ然后完全将它们剪掉,然后转换回压缩二进制文件。 pdftk做了一些魔术来修复输出,使其再次有效,因为原始未编辑的输入也是有效的PDF文件,但编辑后却没有。如果没有一些新模式,这对扩展字符不起作用。

This converts the streams to text format, where I find uses of (blah)Tj and [blah]TJ and just snip them out entirely, then convert back to compressed binary. pdftk does some magic to fix up the output so that it is valid again, because the original unedited input is also a valid PDF file, but not after editing. This will not work with extended characters without some new patterns.

这篇关于从PDF中删除所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆