如何通过一些文本替换和编辑来复制PDF [英] How do I duplicate a PDF with some text replacement and redaction

查看:117
本文介绍了如何通过一些文本替换和编辑来复制PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





从上一天开始,我正在探索几个第三方组件,以便通过C#使用PDF。这些是Aspose.pdf.net和iTextSharp。以下是我正在探索的内容的详细信息:



我有一些PDF包含文本形式的敏感信息,如人名,城市等。

这些PDF需要复制到另一个副本中,但在创建重复副本时,需要找到敏感文本。取而代之的是一些虚拟文本。

替换是必不可少的,以避免通过任何欺诈手段追踪原始信息。

此外,被替换的文本需要编辑。



查找文本预计将支持RegEx,因为可能存在文本的变化需要掩盖。



你能帮我解决一下如何使用iTextShart。



提前致谢。



我尝试了什么:



我尝试过通过iTextSharp探索各种选项并成功复制PDF文件,但却无法搜索和替换文本。

Hi,

Since last one day, I am exploring couple of third party components to work with PDF through C#. These are Aspose.pdf.net and iTextSharp. Following are the details about what I am exploring them for:

I have some PDFs that contain sensitive information in form of text, like name of person, city, etc.
These PDFs need to be duplicated into another copy but while creating duplicated copy, sensitive text needs to be found & replaced with some dummy text.
The replacement is essential to avoid tracing original information, by any fraudulent means.
Also, the replaced text requires to be redacted.

Finding text is expected to support RegEx, as there could be variations of text that needs to be masked.

Could you please assist me how can this be done using iTextShart.

Thanks in advance.

What I have tried:

I tried to explore various options through iTextSharp and succeeded in duplicating PDF file but yet am not able to search and replace text.

推荐答案

这里有一点讨论替换PDF文档中的字符串(ITextSharp或PdfSharp) - Stack Overflow [ ^ ] - 显示的代码可能是也可能是n工作..



我的方法是不同的,取决于你有多少文档格式 - 注意,在任何情况下都不能通过绘图/标记来编辑文本要编辑的文本上有一个黑匣子,因为pdf文档本身仍然保存数据,二进制检查可以显示详细信息。



我会解析文档中的所有文本使用iTextSharp从C#中的PDF中读取文本&安培;#8211; Chris Schiffhauer [ ^ ]并从头开始构建编辑文档 - 好吧,我很容易说,这取决于你的文档有多复杂
There's a little discussion on that here replace string in PDF document (ITextSharp or PdfSharp) - Stack Overflow[^] - the code shown may or may not work ..

My approach would be 'different', and depends on how many document formats you have - note, under no circumstances just redact text by drawing/stamping a 'black box' over the text to be redacted, because the pdf document itself still holds the data, and a binary inspection could reveal the details.

I would parse all the text from a document Read Text from a PDF in C# with iTextSharp – Chris Schiffhauer[^] and build the redacted document from scratch - ok, its easy for me to say that, it depends how complicated your documents are


这是你的另一种方法如果您不希望在结果PDF文件中包含可搜索的文本,则可以关注:

1-从原始PDF文件中解析文本并保留要编辑的文本所在的矩形的记录位于。

2-将PDF页面转换为光栅图像。

3-使用从步骤1获得的矩形信息在光栅图像上绘制编辑矩形。

4-将结果图像保存为不包含任何原始文本的新PDF页面。



通过这种方式,您将保证100%的结果文件没有原始文本。
Here’s another approach that you can follow if you don’t want to have searchable text on the result PDF file:
1- Parse the text from the original PDF file and keep record of the rectangles where the text you wish to redact is located.
2- Convert the PDF pages to raster images.
3- Draw redaction rectangles on the raster images using the rectangle information obtained from step 1.
4- Save the result image as new PDF pages that do not contain any of the original text.

This way you will guarantee 100% that the resulting file has none of the original text.


这篇关于如何通过一些文本替换和编辑来复制PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆