检测和更改PDF中的字符串 [英] Detect and alter strings in PDFs

查看：67 发布时间：2020/5/25 3:54:26 python regex perl pdf pypdf

本文介绍了检测和更改PDF中的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望能够检测PDF中的模式并以某种方式标记它.

I want to be able to detect a pattern in a PDF and somehow flag it.

例如，在此PDF 中，有字符串*2.我希望能够解析PDF，检测*[integer]的所有实例，并做一些事情来吸引人们的注意(例如，将它们突出显示为黄色或在空白处添加一个符号).

For instance, in this PDF, there's the string *2. I want to be able to parse the PDF, detect all instances of *[integer], and do something to call attention to the matches (like highlight them yellow or add a symbol in the margin).

我更喜欢用Python进行此操作，但是我可以使用其他语言.到目前为止，我已经能够使用 pyPdf 来阅读PDF的文本.我可以使用正则表达式来检测模式.但是我还无法弄清楚如何标记匹配项并重新保存PDF.

I would prefer to do this in Python, but I'm open to other languages. So far, I've been able to use pyPdf to read the PDF's text. I can use a regex to detect the pattern. But I haven't been able to figure out how to flag the match and re-save the PDF.

推荐答案

要么人们不感兴趣，要么Python没有能力，所以这里是Perl中的解决方案:-).认真地说，如上所述，您不需要更改字符串". PDF批注是您的解决方案.我不久前有一个带有注释的小项目，那里有一些代码.但是，我的内容解析器不是通用的，并且您不需要全面的解析-意味着能够更改内容并将其写回.因此，我求助于外部工具.我使用的PDF库有些底层，但是我不介意.这也意味着，人们期望对PDF内部结构有适当的了解，以了解发生了什么情况.否则，只需使用该工具即可.

Either people are not interested, or Python's not capable, so here's solution in Perl :-). Seriously, as noted above, you don't need to "alter strings". PDF annotations are the solution for you. I had small project with annotations not long ago, some code's from there. But, my content parser was not universal, and you don't need full-blown parsing -- meaning being able to alter content and write it back. Therefore I resorted to external tool. PDF Library I use is somewhat low-level, but I don't mind. It also means, one's expected to have proper knowledge of PDF internals to understand what's going on. Otherwise, just use the tool.

这是标记的镜头，例如用命令将OP文件中的所有gerunds

Here's a shot of marking e.g. all gerunds in OP's file with a command

perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'

代码(里面的注释也值得一读):

The code (comment inside worth reading, too):

use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;

#####################################################################
#
#  This is PDF highlight mark-up tool.
#  Though fully functional, it's still a prototype proof-of-concept.
#  Please don't feed it with non-pdf files or patterns like '\d*' 
#  (because you probably want '\d+', don't you?).
#  
#  Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
#  ToDo:
#  - error handling is primitive if any.
#  - cropped files (CropBox) are processed incorrectly. Fix it.
#  - of course there can be other useful parameters.
#  - allow loading them from file.
#  - allow searching across lines (e.g. for multi-word patterns)
#    and certainly across "spans" within a line (see mudraw output).
#  - multi-color mark-up, not just yellow.
#  - control over output file name.
#  - compress output (use cleanoutput method instead of output,
#    plus more robust (think compressed object streams) compressors 
#    may be useful).
#  - file list processing.
#  - annotations are not just colorful marks on the page, their 
#    dictionaries can contain all sorts of useful information, which may 
#    be extracted automatically further up the food chain i.e. by 
#    whoever consumes these files (date, time, author, comments, actual 
#    text below, etc., etc., plus think of customized appearence streams,
#    placing them on layers, etc..
#  - ???
#
#   Most complexity in the code comes from adding appearance 
#   dictionary (AP). You can safely delete it, because most viewers don't 
#   need AP for standard annotations. Ironically, muPDF-viewer wants it 
#   (otherwise highlight placement is not 100% correct), and since I relied 
#   on muPDF-tools, I thought it be proper to create PDFs consumable by 
#   their viewer... Firefox wants AP too, btw.
#
#####################################################################

my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file,   '-p=s' => \$csv, 
           '-c!'  => \$c_flag, '-w!'  => \$w_flag) 
    and defined($file)
    and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
       "\t-f\t\tFILE\t PDF file to annotate\n",
       "\t-p\t\tLIST\t comma-separated patterns\n",
       "\t-c or -noc\t\t be case sensitive (default = no)\n",
       "\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
    ->add(split(',', $csv))
    ->anchor_word($w_flag)
    ->flags($c_flag ? '' : 'i')
    ->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);

sub __num_nodes_list {
    my $precision = shift;
    [ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]
}

sub add_highlight {
    my ($idx, $x1, $y1, $x2, $y2) = @_;
    my $p = $pdf->getPage($idx);

    # mirror vertically to get to normal cartesian plane 
    my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
    ($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
    # corner radius
    my $r = 2;

    # AP appearance stream
    my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
    $s .= "1 j @{[sprintf '%.0f', $r * 2]} w\n";
    $s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";
    $s .= "@{[sprintf '%.1f',$y2 - $y1]} re B\n";

    my $highlight = CAM::PDF::Node->new('dictionary', {
        Subtype => CAM::PDF::Node->new('label', 'Highlight'),
        Rect => CAM::PDF::Node->new('array', 
          __num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
        QuadPoints => CAM::PDF::Node->new('array', 
            __num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
        BS => CAM::PDF::Node->new('dictionary', {
            S => CAM::PDF::Node->new('label', 'S'),
            W => CAM::PDF::Node->new('number', 0),
        }),
        Border => CAM::PDF::Node->new('array', 
            __num_nodes_list(0, 0, 0, 0)),
        C => CAM::PDF::Node->new('array', 
            __num_nodes_list(0, 1, 1, 0)),

        AP => CAM::PDF::Node->new('dictionary', {
            N => CAM::PDF::Node->new('reference', 
                $pdf->appendObject(undef, 
                    CAM::PDF::Node->new('object',
                        CAM::PDF::Node->new('dictionary', {
                            Subtype => CAM::PDF::Node->new('label', 'Form'),
                            BBox => CAM::PDF::Node->new('array',
                              __num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2, 
                                                 $y2 - $y1 + $r * 2)),
                            Resources => CAM::PDF::Node->new('dictionary', {
                                ExtGState => CAM::PDF::Node->new('dictionary', {
                                    GS0 => CAM::PDF::Node->new('dictionary', {
                                        BM => CAM::PDF::Node->new('label', 
                                            'Multiply'),
                                    }),
                                }),
                            }),
                            StreamData => CAM::PDF::Node->new('stream', $s),
                            Length => CAM::PDF::Node->new('number', length $s),
                        }),
                    ),
                ,0),
            ),
        }),
    });

    $p->{Annots} ||= CAM::PDF::Node->new('array', []);
    push @{$pdf->getValue($p->{Annots})}, $highlight;

    $pdf->{changes}->{$p->{Type}->{objnum}} = 1
}

my $page_index = 1;
for my $page (@{$tree->{page}}) {
    for my $block (@{$page->{block}}) {
        for my $line (@{$block->{line}}) {
            for my $span (@{$line->{span}}) {
                my $string = join '', map {$_->{c}} @{$span->{char}};
                while ($string =~ /$re/g) {
                    my ($x1, $y1) = 
                        split ' ', $span->{char}->[$-[0]]->{bbox};
                    my (undef, undef, $x2, $y2) = 
                        split ' ', $span->{char}->[$+[0] - 1]->{bbox};
                    add_highlight($page_index, $x1, $y1, $x2, $y2)
                }
            }
        }
    }
    $page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);

__END__

P.s.我用"Perl"标记了问题，以征询社区的一些反馈(代码更正等).

P.s. I tagged the Question with 'Perl', to maybe have some feedback (code corrections, etc.) from community.

这篇关于检测和更改PDF中的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检测和更改PDF中的字符串 [英] Detect and alter strings in PDFs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检测和更改PDF中的字符串 [英] Detect and alter strings in PDFs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭