在未压缩的 PDF 中进行集体内联编辑 [英] en masse inline editing in an uncompressed PDF

查看:64
本文介绍了在未压缩的 PDF 中进行集体内联编辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 PDF(约 20 mb,160 mb.未压缩).我需要在其中的文本中进行查找和替换,大约 1000 次.这是我尝试过的.

  1. 通过 SVG

    • 转换为 SVG (inkscape)
    • 逐行读取SVG并在文件中进行替换
    • 转换回 PDF

<块引用>

=> 糟糕的输出,可能是由于 SVG 中的一些几何变换矩阵,文本渲染得不好

  1. 创建 ~1000 sed 命令

    • 解压缩 PDF
    • 使用 sed 命令执行每次替换
    • 重新压缩 PDF

<块引用>

=> 太长了.每个 sed 命令大约需要 20 秒,导致几个小时的过程

  1. 逐行读取并替换

    • 解压 PDF
    • 逐行阅读PDF
      • 查找要替换的文本
      • 使用 perl 替换
      • 将行写入新文件
    • 压缩新文件

<块引用>

=> 由于未压缩的 PDF 中存在数据流,新文件显然已损坏(将二进制文件写入为文本行)

我想知道是否可以逐行阅读未压缩的 PDF,但直接在其中进行编辑.我怎么能这样做?

我搜索了 perl 内联编辑,但它一次对整个文件执行更改,而我想编辑一行.

非常欢迎其他想法;)

按照建议,我使用了 CAM::PDF,这是最有效和最简单的解决方案

解决方案

2. 和 3 没有区别.sed 逐行读取输入文件并将更改的行写入输出文件.如果您将 -i 切换到它,sed 只需打开输入文件,然后取消链接(这是 rm 所做的),然后打开输出文件同名并写入.就是这样.不涉及魔法.因此,如果您通过 Perl 而不是通过 sed 损坏了内容,那么您会做一些与 sed 不同的事情.主要区别在于,您可以使 Perl 脚本更快地替换许多字符串.请参阅在带有 csv 的文本文件上使用 sed

主要技巧是您可以为搜索和替换编译正则表达式,它可以在线性时间内工作.

my %replace = ( foo => 'bar' );我的 $re = 加入 '|',映射 quotemeta,键 %replace;$re = qr/($re)/;而 (<>) {s/$re/$replace{$1}/g;}

你可以用你原来的方法使用它,但我建议在 Perl 脚本中使用它,它允许你保留正则表达式并替换 pdf 文件之间的哈希.您也可以尝试将其与 CAM::PDF 结合使用.有示例脚本 changepagestring.pl 在里面.您还可以查看 PDF::API2 这需要更多的工作,但可能会提供更好的结果.但请记住,PDF 格式不可用于修改.

I have a large PDF (~20mb, 160 mb. uncompressed). I need to do a find and replace in the text in it, about 1000 times. Here is what I tried.

  1. Via SVG

    • Tranform to SVG (inkscape)
    • Read SVG line by line and do the replace in the file
    • Transform back to PDF

=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered

  1. Creating ~1000 sed command

    • Uncompress PDF
    • Perform each replace with a sed command
    • Recompress PDF

=> way too long. each sed command takes about 20 sec, leading to several hours of process

  1. Read line-by-line and replace

    • Uncompress PDF
    • Read line by line the PDF
      • find text to be replaced
      • replace using perl
      • write line to a new file
    • Compress the new file

=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)

I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?

I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.

Other ideas are more than welcome ;)

Following advise, I used CAM::PDF, this was the most efficient and simple solution

解决方案

There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv

The main trick is you can compile regexp for search nad replace which works in linear time.

my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;

while (<>) {
    s/$re/$replace{$1}/g;
}

You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.

这篇关于在未压缩的 PDF 中进行集体内联编辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆