如何在PDF中查找所有特定文本并在上面插入分页符？ [英] How to find all occurrences of specific text in a PDF and insert a page break above?

查看：617 发布时间：2018/11/16 16:52:54 c# pdf itext

本文介绍了如何在PDF中查找所有特定文本并在上面插入分页符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对PDF有一个棘手的要求

I have a tricky requirement with a PDF

我需要在pdf中搜索特定的字符串 - 物业编号：

I need to search my pdf for a specific string - Property Number:

每次找到这个，我都需要在上面添加分页符

Each time this is found, I need to add a page break ABOVE

我可以访问IText和Spire.PDF，我正在查看IText第一次

I have access to both IText and Spire.PDF, I am looking at IText first

我已在此处的其他帖子中建立了我需要使用PDF Stamper

I have established from other posts here that I need to use a PDF Stamper

以下逻辑添加一个有效的新页面

The logic below adds a new page which does work

但是，在我的情况下，我只需要一个分页符而不是空白页

However, in my case, I just need a page break not a blank page

var newFile = @"c:\temp\full.pdf";
var dest = @"c:\temp\dest.pdf";
var reader = new PdfReader(newFile);
if (File.Exists(dest))
{
  File.Delete(dest);
}

var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew));
var total = reader.NumberOfPages + 1;
for (var pageNumber = total; pageNumber > 0; pageNumber--)
{
  var pageContent = reader.GetPageContent(pageNumber);
  stamper.InsertPage(pageNumber, PageSize.A4);
}

stamper.Close();
reader.Close();

下图显示了一个示例，所以这实际上是3页，现有页面，一个新的在第一次出现的物业编号上方插入分页符：

The picture below shows an example, so this would actually be 3 pages, the existing page, a new page break inserted above the first occurrence of Property Number:

第二次出现时需要另一个分页符

Another page break is needed above the second occurrence

SearchTextLocationExtractionStrategy

提取位置对于自定义文本，我们扩展iText LocationTextExtractionStrategy 以允许提取自定义文本文本字符串的位置，实际上是正则表达式的匹配位置：

SearchTextLocationExtractionStrategy

To extract the locations of custom text, we extend the iText LocationTextExtractionStrategy to also allow to extract the positions of a custom text text string, actually of matches of a regular expression:

public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy {
    public SearchTextLocationExtractionStrategy(Pattern pattern) {
        super(new TextChunkLocationStrategy() {
            public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) {
                // while baseLine has been changed to not neutralize
                // effects of rise, ascentLine and descentLine explicitly
                // have not: We want the actual positions.
                return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(),
                        renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth());
            }
        });
        this.pattern = pattern;
    }

    static Field locationalResultField = null;
    static Method filterTextChunksMethod = null;
    static Method startsWithSpaceMethod = null;
    static Method endsWithSpaceMethod = null;
    static Field textChunkTextField = null;
    static Method textChunkSameLineMethod = null;
    static {
        try {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks",
                    List.class, TextChunkFilter.class);
            filterTextChunksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace",
                    String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);
            textChunkTextField = TextChunk.class.getDeclaredField("text");
            textChunkTextField.setAccessible(true);
            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        } catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public Collection<TextRectangle> getLocations(TextChunkFilter chunkFilter) {
        Collection<TextRectangle> result = new ArrayList<>();
        try {
            List<TextChunk> filteredTextChunks = (List<TextChunk>) filterTextChunksMethod.invoke(this,
                    locationalResultField.get(this), chunkFilter);
            Collections.sort(filteredTextChunks);

            StringBuilder sb = new StringBuilder();
            List<AscentDescentTextChunkLocation> locations = new ArrayList<>();
            TextChunk lastChunk = null;
            for (TextChunk chunk : filteredTextChunks) {
                String chunkText = (String) textChunkTextField.get(chunk);
                if (lastChunk == null) {
                    // Nothing to compare with at the end
                } else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) {
                    // we only insert a blank space if the trailing character of the previous string
                    // wasn't a space,
                    // and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk)
                            && !((boolean) startsWithSpaceMethod.invoke(this, chunkText))
                            && !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) {
                        sb.append(' ');
                        LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(),
                                chunk.getStartLocation());
                        locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine,
                                chunk.getCharSpaceWidth()));
                    }
                } else {
                    assert sb.length() == locations.size();
                    Matcher matcher = pattern.matcher(sb);
                    while (matcher.find()) {
                        int i = matcher.start();
                        Vector baseStart = locations.get(i).getStartLocation();
                        TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1),
                                baseStart.get(Vector.I2));
                        for (; i < matcher.end(); i++) {
                            AscentDescentTextChunkLocation location = locations.get(i);
                            textRectangle.add(location.getAscentLine().getBoundingRectange());
                            textRectangle.add(location.getDescentLine().getBoundingRectange());
                        }

                        result.add(textRectangle);
                    }

                    sb.setLength(0);
                    locations.clear();
                }
                sb.append(chunkText);
                locations.add((AscentDescentTextChunkLocation) chunk.getLocation());
                lastChunk = chunk;
            }
        } catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return result;
    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
            super.renderText(info);
    }

    public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp {
        public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine,
                float charSpaceWidth) {
            super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth);
            this.ascentLine = ascentLine;
            this.descentLine = descentLine;
        }

        public LineSegment getAscentLine() {
            return ascentLine;
        }

        public LineSegment getDescentLine() {
            return descentLine;
        }

        final LineSegment ascentLine;
        final LineSegment descentLine;
    }

    public class TextRectangle extends Rectangle2D.Float {
        public TextRectangle(final String text, final float xStart, final float yStart) {
            super(xStart, yStart, 0, 0);
            this.text = text;
        }

        public String getText() {
            return text;
        }

        final String text;
    }

    final Pattern pattern;
}

（ SearchTextLocationExtractionStrategy.java ）

由于基类的某些必要成员是私有的或包私有的，我们必须使用反射来提取它们。

As some necessary members of the base class are private or package private, we have to use reflection to extract them.

此工具的页面拆分功能已从 PdfVeryDenseMergeTool 中提取href =https://stackoverflow.com/a/29078954/1729265>这个答案。此外，允许自定义分页位置是抽象的。

The page splitting functionality of this tool has been extracted from the PdfVeryDenseMergeTool from this answer. Furthermore, it is abstract to allow custom positions for page breaks.

public abstract class AbstractPdfPageSplittingTool {
    public AbstractPdfPageSplittingTool(Rectangle size, float top) {
        this.pageSize = size;
        this.topMargin = top;
    }

    public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException {
        try {
            openDocument(outputStream);
            for (PdfReader reader : inputs) {
                split(reader);
            }
        } finally {
            closeDocument();
        }
    }

    void openDocument(OutputStream outputStream) throws DocumentException {
        final Document document = new Document(pageSize, 36, 36, topMargin, 36);
        final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        document.open();
        this.document = document;
        this.writer = writer;
        newPage();
    }

    void closeDocument() {
        try {
            document.close();
        } finally {
            this.document = null;
            this.writer = null;
            this.yPosition = 0;
        }
    }

    void newPage() {
        document.newPage();
        yPosition = pageSize.getTop(topMargin);
    }

    void split(PdfReader reader) throws IOException {
        for (int page = 1; page <= reader.getNumberOfPages(); page++) {
            split(reader, page);
        }
    }

    void split(PdfReader reader, int page) throws IOException
    {
        PdfImportedPage importedPage = writer.getImportedPage(reader, page);
        PdfContentByte directContent = writer.getDirectContent();
        yPosition = pageSize.getTop();

        Rectangle pageSizeToImport = reader.getPageSize(page);
        float[] borderPositions = determineSplitPositions(reader, page);
        if (borderPositions == null || borderPositions.length < 2)
            return;

        for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) {
            float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1];
            if (height <= 0)
                continue;

            directContent.saveState();
            directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height);
            directContent.clip();
            directContent.newPath();

            writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom()));

            directContent.restoreState();
            newPage();
        }
    }

    protected abstract float[] determineSplitPositions(PdfReader reader, int page);

    Document document = null;
    PdfWriter writer = null;
    float yPosition = 0;

    final Rectangle pageSize;
    final float topMargin;
}

（ AbstractPdfPageSplittingTool.java ）

实现OP的任务：

我需要在pdf中搜索特定字符串 - 物业编号：

I need to search my pdf for a specific string - Property Number:

每次找到此字符串时，我需要在上面添加分页符

Each time this is found, I need to add a page break ABOVE

可以像上面这样使用上面的类：

one can use the classes above like this:

AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) {
    @Override
    protected float[] determineSplitPositions(PdfReader reader, int page) {
        Collection<TextRectangle> locations = Collections.emptyList();
        try {
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy(
                    Pattern.compile("Property Number"));
            parser.processContent(page, strategy, Collections.emptyMap()).getResultantText();
            locations = strategy.getLocations(null);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        List<Float> borders = new ArrayList<>();
        for (TextRectangle rectangle : locations)
        {
            borders.add((float)rectangle.getMaxY());
        }

        Rectangle pageSize = reader.getPageSize(page);
        borders.add(pageSize.getTop());
        borders.add(pageSize.getBottom());
        Collections.sort(borders, Collections.reverseOrder());

        float[] result = new float[borders.size()];
        for (int i=0; i < result.length; i++)
            result[i] = borders.get(i);
        return result;
    }
};

tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE));

（ SplitPages.java 测试方法 testSplitDocumentAboveAngestellter ）

(SplitPages.java test method testSplitDocumentAboveAngestellter)

这篇关于如何在PDF中查找所有特定文本并在上面插入分页符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在PDF中查找所有特定文本并在上面插入分页符？ [英] How to find all occurrences of specific text in a PDF and insert a page break above?

问题描述

推荐答案

SearchTextLocationExtractionStrategy

SearchTextLocationExtractionStrategy

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何在PDF中查找所有特定文本并在上面插入分页符？ [英] How to find all occurrences of specific text in a PDF and insert a page break above?

问题描述

推荐答案

SearchTextLocationExtractionStrategy

SearchTextLocationExtractionStrategy

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭