如何在PDF中查找所有特定文本并在上面插入分页符? [英] How to find all occurrences of specific text in a PDF and insert a page break above?

查看:617
本文介绍了如何在PDF中查找所有特定文本并在上面插入分页符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对PDF有一个棘手的要求

I have a tricky requirement with a PDF

我需要在pdf中搜索特定的字符串 - 物业编号:

I need to search my pdf for a specific string - Property Number:

每次找到这个,我都需要在上面添加分页符

Each time this is found, I need to add a page break ABOVE

我可以访问IText和Spire.PDF,我正在查看IText第一次

I have access to both IText and Spire.PDF, I am looking at IText first

我已在此处的其他帖子中建立了我需要使用PDF Stamper

I have established from other posts here that I need to use a PDF Stamper

以下逻辑添加一个有效的新页面

The logic below adds a new page which does work

但是,在我的情况下,我只需要一个分页符而不是空白页

However, in my case, I just need a page break not a blank page

var newFile = @"c:\temp\full.pdf";
var dest = @"c:\temp\dest.pdf";
var reader = new PdfReader(newFile);
if (File.Exists(dest))
{
  File.Delete(dest);
}

var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew));
var total = reader.NumberOfPages + 1;
for (var pageNumber = total; pageNumber > 0; pageNumber--)
{
  var pageContent = reader.GetPageContent(pageNumber);
  stamper.InsertPage(pageNumber, PageSize.A4);
}

stamper.Close();
reader.Close();

下图显示了一个示例,所以这实际上是3页,现有页面,一个新的在第一次出现的物业编号上方插入分页符:

The picture below shows an example, so this would actually be 3 pages, the existing page, a new page break inserted above the first occurrence of Property Number:

第二次出现时需要另一个分页符

Another page break is needed above the second occurrence

推荐答案

此答案分享的概念验证使用iText和Java在PDF中查找所有特定文本并在上面插入分页符。将它移植到iTextSharp和C#应该不会太困难。

This answer shares a proof-of-concept for finding all occurrences of specific text in a PDF and inserting a page break above using iText and Java. It should not be too difficult to port it to iTextSharp and C#.

此外,对于生产使用,必须添加一些额外的代码,因为当前代码做了一些假设,它例如假定非旋转页面。此外,它根本不处理注释。

Furthermore, for production use some extra code has to be added as currently the code makes some assumptions, it e.g. assumes non-rotated pages. Furthermore it does not handle annotations at all.

该任务实际上是两个任务的组合,查找插入页面休息,因此我们需要

The task actually is a combination of two tasks, the finding and the inserting page breaks, thus we need


  1. 某些自定义文字位置的提取策略和

  2. 工具剪切页面。



SearchTextLocationExtractionStrategy



提取位置对于自定义文本,我们扩展iText LocationTextExtractionStrategy 以允许提取自定义文本文本字符串的位置,实际上是正则表达式的匹配位置:

SearchTextLocationExtractionStrategy

To extract the locations of custom text, we extend the iText LocationTextExtractionStrategy to also allow to extract the positions of a custom text text string, actually of matches of a regular expression:

public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy {
    public SearchTextLocationExtractionStrategy(Pattern pattern) {
        super(new TextChunkLocationStrategy() {
            public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) {
                // while baseLine has been changed to not neutralize
                // effects of rise, ascentLine and descentLine explicitly
                // have not: We want the actual positions.
                return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(),
                        renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth());
            }
        });
        this.pattern = pattern;
    }

    static Field locationalResultField = null;
    static Method filterTextChunksMethod = null;
    static Method startsWithSpaceMethod = null;
    static Method endsWithSpaceMethod = null;
    static Field textChunkTextField = null;
    static Method textChunkSameLineMethod = null;
    static {
        try {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks",
                    List.class, TextChunkFilter.class);
            filterTextChunksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace",
                    String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);
            textChunkTextField = TextChunk.class.getDeclaredField("text");
            textChunkTextField.setAccessible(true);
            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        } catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public Collection<TextRectangle> getLocations(TextChunkFilter chunkFilter) {
        Collection<TextRectangle> result = new ArrayList<>();
        try {
            List<TextChunk> filteredTextChunks = (List<TextChunk>) filterTextChunksMethod.invoke(this,
                    locationalResultField.get(this), chunkFilter);
            Collections.sort(filteredTextChunks);

            StringBuilder sb = new StringBuilder();
            List<AscentDescentTextChunkLocation> locations = new ArrayList<>();
            TextChunk lastChunk = null;
            for (TextChunk chunk : filteredTextChunks) {
                String chunkText = (String) textChunkTextField.get(chunk);
                if (lastChunk == null) {
                    // Nothing to compare with at the end
                } else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) {
                    // we only insert a blank space if the trailing character of the previous string
                    // wasn't a space,
                    // and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk)
                            && !((boolean) startsWithSpaceMethod.invoke(this, chunkText))
                            && !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) {
                        sb.append(' ');
                        LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(),
                                chunk.getStartLocation());
                        locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine,
                                chunk.getCharSpaceWidth()));
                    }
                } else {
                    assert sb.length() == locations.size();
                    Matcher matcher = pattern.matcher(sb);
                    while (matcher.find()) {
                        int i = matcher.start();
                        Vector baseStart = locations.get(i).getStartLocation();
                        TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1),
                                baseStart.get(Vector.I2));
                        for (; i < matcher.end(); i++) {
                            AscentDescentTextChunkLocation location = locations.get(i);
                            textRectangle.add(location.getAscentLine().getBoundingRectange());
                            textRectangle.add(location.getDescentLine().getBoundingRectange());
                        }

                        result.add(textRectangle);
                    }

                    sb.setLength(0);
                    locations.clear();
                }
                sb.append(chunkText);
                locations.add((AscentDescentTextChunkLocation) chunk.getLocation());
                lastChunk = chunk;
            }
        } catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return result;
    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
            super.renderText(info);
    }

    public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp {
        public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine,
                float charSpaceWidth) {
            super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth);
            this.ascentLine = ascentLine;
            this.descentLine = descentLine;
        }

        public LineSegment getAscentLine() {
            return ascentLine;
        }

        public LineSegment getDescentLine() {
            return descentLine;
        }

        final LineSegment ascentLine;
        final LineSegment descentLine;
    }

    public class TextRectangle extends Rectangle2D.Float {
        public TextRectangle(final String text, final float xStart, final float yStart) {
            super(xStart, yStart, 0, 0);
            this.text = text;
        }

        public String getText() {
            return text;
        }

        final String text;
    }

    final Pattern pattern;
}

SearchTextLocationExtractionStrategy.java

由于基类的某些必要成员是私有的或包私有的,我们必须使用反射来提取它们。

As some necessary members of the base class are private or package private, we have to use reflection to extract them.

此工具的页面拆分功能已从 PdfVeryDenseMergeTool 中提取href =https://stackoverflow.com/a/29078954/1729265>这个答案。此外,允许自定义分页位置是抽象的。

The page splitting functionality of this tool has been extracted from the PdfVeryDenseMergeTool from this answer. Furthermore, it is abstract to allow custom positions for page breaks.

public abstract class AbstractPdfPageSplittingTool {
    public AbstractPdfPageSplittingTool(Rectangle size, float top) {
        this.pageSize = size;
        this.topMargin = top;
    }

    public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException {
        try {
            openDocument(outputStream);
            for (PdfReader reader : inputs) {
                split(reader);
            }
        } finally {
            closeDocument();
        }
    }

    void openDocument(OutputStream outputStream) throws DocumentException {
        final Document document = new Document(pageSize, 36, 36, topMargin, 36);
        final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        document.open();
        this.document = document;
        this.writer = writer;
        newPage();
    }

    void closeDocument() {
        try {
            document.close();
        } finally {
            this.document = null;
            this.writer = null;
            this.yPosition = 0;
        }
    }

    void newPage() {
        document.newPage();
        yPosition = pageSize.getTop(topMargin);
    }

    void split(PdfReader reader) throws IOException {
        for (int page = 1; page <= reader.getNumberOfPages(); page++) {
            split(reader, page);
        }
    }

    void split(PdfReader reader, int page) throws IOException
    {
        PdfImportedPage importedPage = writer.getImportedPage(reader, page);
        PdfContentByte directContent = writer.getDirectContent();
        yPosition = pageSize.getTop();

        Rectangle pageSizeToImport = reader.getPageSize(page);
        float[] borderPositions = determineSplitPositions(reader, page);
        if (borderPositions == null || borderPositions.length < 2)
            return;

        for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) {
            float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1];
            if (height <= 0)
                continue;

            directContent.saveState();
            directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height);
            directContent.clip();
            directContent.newPath();

            writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom()));

            directContent.restoreState();
            newPage();
        }
    }

    protected abstract float[] determineSplitPositions(PdfReader reader, int page);

    Document document = null;
    PdfWriter writer = null;
    float yPosition = 0;

    final Rectangle pageSize;
    final float topMargin;
}

AbstractPdfPageSplittingTool.java

实现OP的任务:


我需要在pdf中搜索特定字符串 - 物业编号:

I need to search my pdf for a specific string - Property Number:

每次找到此字符串时,我需要在上面添加分页符

Each time this is found, I need to add a page break ABOVE

可以像上面这样使用上面的类:

one can use the classes above like this:

AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) {
    @Override
    protected float[] determineSplitPositions(PdfReader reader, int page) {
        Collection<TextRectangle> locations = Collections.emptyList();
        try {
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy(
                    Pattern.compile("Property Number"));
            parser.processContent(page, strategy, Collections.emptyMap()).getResultantText();
            locations = strategy.getLocations(null);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        List<Float> borders = new ArrayList<>();
        for (TextRectangle rectangle : locations)
        {
            borders.add((float)rectangle.getMaxY());
        }

        Rectangle pageSize = reader.getPageSize(page);
        borders.add(pageSize.getTop());
        borders.add(pageSize.getBottom());
        Collections.sort(borders, Collections.reverseOrder());

        float[] result = new float[borders.size()];
        for (int i=0; i < result.length; i++)
            result[i] = borders.get(i);
        return result;
    }
};

tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE));

SplitPages.java 测试方法 testSplitDocumentAboveAngestellter

(SplitPages.java test method testSplitDocumentAboveAngestellter)

这篇关于如何在PDF中查找所有特定文本并在上面插入分页符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆