从具有不同高度的表格行中提取pdf(使用pdfbox库的java)中的文本 [英] Extracting text from pdf (java using pdfbox library) from a table's rows with different heights

查看:48
本文介绍了从具有不同高度的表格行中提取pdf(使用pdfbox库的java)中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

黑色形状是需要提取的文本:

到目前为止,我已经从列中提取了文本,但是是手动提取的,因为只有 5 个(使用区域的 Rectangle 类).我的问题是:有没有办法对行这样做,因为矩形的大小(高度)不同,手动将其设置为 50+ 行会是一种暴行吗?更具体地说,我可以使用函数根据每一行的高度更改矩形吗?或者任何可能有帮助的建议?

解决方案

如评论中所建议的,您可以通过解析页面的矢量图形指令自动识别示例 PDF 的表格单元格区域.

对于这样的任务,您可以扩展 PDFBox PDFGraphicsStreamEngine,它提供了用于路径构建和绘制指令的抽象方法.

注意:我在这里展示的流引擎类专门用于识别表格单元格框架线,这些线绘制为示例文档中使用的填充黑色的长小矩形.对于通用解决方案,您至少还应该将绘制的框架线识别为矢量图形线段或描边矩形.

流引擎类PdfBoxFinder

这个流引擎类收集水平线的y坐标范围和垂直线的x坐标范围,然后提供这些坐标范围定义的网格的框.特别是这意味着不支持行跨度或列跨度;在手头的情况下,这是可以的,因为没有这样的跨度.

public class PdfBoxFinder extends PDFGraphicsStreamEngine {/*** 在这里提供分析页面;分析多个页面* 创建多个 {@link PdfBoxFinder} 实例.*/公共 PdfBoxFinder(PDPage 页面){超级(页面);}/*** 框({@link Rectangle2D} 实例的坐标根据* PDF坐标系,例如用于装饰表格单元格)* {@link PdfBoxFinder} 已在当前页面识别.*/公共地图getBoxes() {合并列表();Map结果 = 新的 HashMap<>();如果 (!horizo​​ntalLines.isEmpty() && !verticalLines.isEmpty()){间隔顶部 = horizo​​ntalLines.get(horizo​​ntalLines.size() - 1);字符 rowLetter = 'A';for (int i = horizo​​ntalLines.size() - 2; i >= 0; i--, rowLetter++) {间隔底部 = horizo​​ntalLines.get(i);间隔左 = verticalLines.get(0);整数列 = 1;for (int j = 1; j 获取区域(){PDRectanglecropBox = getPage().getCropBox();浮动 xOffset =cropBox.getLowerLeftX();浮动 yOffset =cropBox.getUpperRightY();Map结果 = getBoxes();for (Map.Entry entry : result.entrySet()) {Rectangle2D box = entry.getValue();Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());entry.setValue(区域);}返回结果;}/*** <p>* 处理当前在 {@link #path} 列表中的路径元素和* 最终清除列表.* </p>* <p>* 目前只考虑元素* </p>* <ul>* 
  • 是 {@link Rectangle} 实例;* <li>填充得相当黑;* <li>具有细长的形式;和* <li>具有与坐标轴相当平行的边.* </ul>*/void processPath() 抛出 IOException {PDColor 颜色 = getGraphicsState().getNonStrokingColor();如果(!isBlack(颜色)){logger.debug(由于非黑色填充颜色导致路径丢失.");返回;}for (PathElement pathElement : path) {如果(路径元素实例矩形){矩形矩形 = (Rectangle) pathElement;双 p0p1 = rectangle.p0.distance(rectangle.p1);双 p1p2 = rectangle.p1.distance(rectangle.p2);布尔 p0p1small = p0p1 <3;布尔 p1p2small = p1p2 <3;如果(p0p1small){如果(p1p2small){logger.debug(两边的矩形都太小了.");} 别的 {processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);}} else if (p1p2small) {processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);} 别的 {logger.debug(两边的矩形都太大了.");}}}路径.清除();}/*** 参数点应排序为 (p0, p1) 和 (p2, p3) 为小* 边和 (p1, p2) 和 (p3, p0) 长边.*/void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {float longXDiff = (float)Math.abs(p2.getX() - p1.getX());float longYDiff = (float)Math.abs(p2.getY() - p1.getY());布尔型 longXDiffSmall = longXDiff * 10 5)返回假;值/= 256;}返回真;}////PDFGraphicsStreamEngine 覆盖//@覆盖public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) 抛出 IOException {path.add(new Rectangle(p0, p1, p2, p3));}@覆盖public void endPath() 抛出 IOException {路径.清除();}@覆盖public void strokePath() 抛出 IOException {路径.清除();}@覆盖public void fillPath(int windRule) 抛出 IOException {进程路径();}@覆盖public void fillAndStrokePath(int windRule) 抛出 IOException {进程路径();}@Override public void drawImage(PDImage pdImage) 抛出 IOException { }@Override public void clip(intwindingRule) 抛出 IOException { }@Override public void moveTo(float x, float y) 抛出 IOException { }@Override public void lineTo(float x, float y) 抛出 IOException { }@Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) 抛出 IOException { }@Override public Point2D getCurrentPoint() 抛出 IOException { return null;}@Override public void closePath() 抛出 IOException { }@Override public void shadingFill(COSName shadingName) 抛出 IOException { }////内部类//类 Interval 实现了 Comparable{最终浮出水面;最终浮动到;间隔(浮动...值){Arrays.sort(values);this.from = 值[0];this.to = values[values.length - 1];}间隔(双...值){Arrays.sort(values);this.from = (float) values[0];this.to = (float) values[values.length - 1];}布尔combinableWith(间隔其他){如果 (this.from > other.from)返回 other.combinableWith(this);如果 (this.to < other.from)返回假;浮动交叉点长度 = Math.min(this.to, other.to) - other.from;float thisLength = this.to - this.from;float otherLength = other.to - other.from;return (intersectionLength >= thisLength * .9f) ||(intersectionLength >= otherLength * .9f);}间隔 combineWith(Interval other) {返回新间隔(this.from,this.to,other.from,other.to);}@覆盖公共 int compareTo(间隔 o){返回 this.from == o.from ?Float.compare(this.to, o.to) : Float.compare(this.from, o.from);}@覆盖公共字符串 toString() {return String.format("[%3.2f, %3.2f]", from, to);}}接口路径元素{}类 Rectangle 实现 PathElement {最终 Point2D p0, p1, p2, p3;矩形(Point2D p0,Point2D p1,Point2D p2,Point2D p3){this.p0 = p0;this.p1 = p1;this.p2 = p2;this.p3 = p3;}}////成员//最终列表path = new ArrayList<>();最终列表<间隔>水平线 = 新的 ArrayList<>();最终列表<间隔>VerticalLines = new ArrayList<>();最终记录器记录器 = LoggerFactory.getLogger(PdfBoxFinder.class);}
  • (

    Black shapes are text that need to be extracted:

    So far, i've extracted the text from columns, but manually, because there are only 5 (using the Rectangle class for the regions). My question is: is there a way to do so for rows since the size (height) of the Rectangles are different and manually doing it to 50+ rows would be an atrocity? More specific, can i change the rectangle according to every row's height using a function? Or any suggestion that may help?

    解决方案

    As proposed in comments, you can automatically recognize the table cell regions of your example PDF by parsing the vector graphics instructions of the page.

    For such a task you can extend the PDFBox PDFGraphicsStreamEngine which provides abstract methods called for path building and drawing instructions.

    Beware: The stream engine class I show here is specialized on recognizing table cell frame lines drawn as long, small rectangles filled with black as used in your example document. For a general solution you should at least also recognize frame lines drawn as vector graphics line segments or as stroked rectangles.

    The stream engine class PdfBoxFinder

    This stream engine class collects the y coordinate ranges of horizontal lines and the x coordinate ranges of vertical lines and afterwards provides the boxes of the grid defined by these coordinate ranges. In particular this means that row spans or column spans are not supported; in the case at hand this is ok as there are no such spans.

    public class PdfBoxFinder extends PDFGraphicsStreamEngine {
        /**
         * Supply the page to analyze here; to analyze multiple pages
         * create multiple {@link PdfBoxFinder} instances.
         */
        public PdfBoxFinder(PDPage page) {
            super(page);
        }
    
        /**
         * The boxes ({@link Rectangle2D} instances with coordinates according to
         * the PDF coordinate system, e.g. for decorating the table cells) the
         * {@link PdfBoxFinder} has recognized on the current page.
         */
        public Map<String, Rectangle2D> getBoxes() {
            consolidateLists();
            Map<String, Rectangle2D> result = new HashMap<>();
            if (!horizontalLines.isEmpty() && !verticalLines.isEmpty())
            {
                Interval top = horizontalLines.get(horizontalLines.size() - 1);
                char rowLetter = 'A';
                for (int i = horizontalLines.size() - 2; i >= 0; i--, rowLetter++) {
                    Interval bottom = horizontalLines.get(i);
                    Interval left = verticalLines.get(0);
                    int column = 1;
                    for (int j = 1; j < verticalLines.size(); j++, column++) {
                        Interval right = verticalLines.get(j);
                        String name = String.format("%s%s", rowLetter, column);
                        Rectangle2D rectangle = new Rectangle2D.Float(left.from, bottom.from, right.to - left.from, top.to - bottom.from);
                        result.put(name, rectangle);
                        left = right;
                    }
                    top = bottom;
                }
            }
            return result;
        }
    
        /**
         * The regions ({@link Rectangle2D} instances with coordinates according
         * to the PDFBox text extraction API, e.g. for initializing the regions of
         * a {@link PDFTextStripperByArea}) the {@link PdfBoxFinder} has recognized
         * on the current page.
         */
        public Map<String, Rectangle2D> getRegions() {
            PDRectangle cropBox = getPage().getCropBox();
            float xOffset = cropBox.getLowerLeftX();
            float yOffset = cropBox.getUpperRightY();
            Map<String, Rectangle2D> result = getBoxes();
            for (Map.Entry<String, Rectangle2D> entry : result.entrySet()) {
                Rectangle2D box = entry.getValue();
                Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());
                entry.setValue(region);
            }
            return result;
        }
    
        /**
         * <p>
         * Processes the path elements currently in the {@link #path} list and
         * eventually clears the list.
         * </p>
         * <p>
         * Currently only elements are considered which 
         * </p>
         * <ul>
         * <li>are {@link Rectangle} instances;
         * <li>are filled fairly black;
         * <li>have a thin and long form; and
         * <li>have sides fairly parallel to the coordinate axis.
         * </ul>
         */
        void processPath() throws IOException {
            PDColor color = getGraphicsState().getNonStrokingColor();
            if (!isBlack(color)) {
                logger.debug("Dropped path due to non-black fill-color.");
                return;
            }
    
            for (PathElement pathElement : path) {
                if (pathElement instanceof Rectangle) {
                    Rectangle rectangle = (Rectangle) pathElement;
    
                    double p0p1 = rectangle.p0.distance(rectangle.p1);
                    double p1p2 = rectangle.p1.distance(rectangle.p2);
                    boolean p0p1small = p0p1 < 3;
                    boolean p1p2small = p1p2 < 3;
    
                    if (p0p1small) {
                        if (p1p2small) {
                            logger.debug("Dropped rectangle too small on both sides.");
                        } else {
                            processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);
                        }
                    } else if (p1p2small) {
                        processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);
                    } else {
                        logger.debug("Dropped rectangle too large on both sides.");
                    }
                }
            }
            path.clear();
        }
    
        /**
         * The argument points shall be sorted to have (p0, p1) and (p2, p3) be the small
         * edges and (p1, p2) and (p3, p0) the long ones.
         */
        void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            float longXDiff = (float)Math.abs(p2.getX() - p1.getX());
            float longYDiff = (float)Math.abs(p2.getY() - p1.getY());
            boolean longXDiffSmall = longXDiff * 10 < longYDiff;
            boolean longYDiffSmall = longYDiff * 10 < longXDiff;
    
            if (longXDiffSmall) {
                verticalLines.add(new Interval(p0.getX(), p1.getX(), p2.getX(), p3.getX()));
            } else if (longYDiffSmall) {
                horizontalLines.add(new Interval(p0.getY(), p1.getY(), p2.getY(), p3.getY()));
            } else {
                logger.debug("Dropped rectangle too askew.");
            }
        }
    
        /**
         * Sorts the {@link #horizontalLines} and {@link #verticalLines} lists and
         * merges fairly identical entries.
         */
        void consolidateLists() {
            for (List<Interval> intervals : Arrays.asList(horizontalLines, verticalLines)) {
                intervals.sort(null);
                for (int i = 1; i < intervals.size();) {
                    if (intervals.get(i-1).combinableWith(intervals.get(i))) {
                        Interval interval = intervals.get(i-1).combineWith(intervals.get(i));
                        intervals.set(i-1, interval);
                        intervals.remove(i);
                    } else {
                        i++;
                    }
                }
            }
        }
    
        /**
         * Checks whether the given color is black'ish.
         */
        boolean isBlack(PDColor color) throws IOException {
            int value = color.toRGB();
            for (int i = 0; i < 2; i++) {
                int component = value & 0xff;
                if (component > 5)
                    return false;
                value /= 256;
            }
            return true;
        }
    
        //
        // PDFGraphicsStreamEngine overrides
        //
        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
            path.add(new Rectangle(p0, p1, p2, p3));
        }
    
        @Override
        public void endPath() throws IOException {
            path.clear();
        }
    
        @Override
        public void strokePath() throws IOException {
            path.clear();
        }
    
        @Override
        public void fillPath(int windingRule) throws IOException {
            processPath();
        }
    
        @Override
        public void fillAndStrokePath(int windingRule) throws IOException {
            processPath();
        }
    
        @Override public void drawImage(PDImage pdImage) throws IOException { }
        @Override public void clip(int windingRule) throws IOException { }
        @Override public void moveTo(float x, float y) throws IOException { }
        @Override public void lineTo(float x, float y) throws IOException { }
        @Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
        @Override public Point2D getCurrentPoint() throws IOException { return null; }
        @Override public void closePath() throws IOException { }
        @Override public void shadingFill(COSName shadingName) throws IOException { }
    
        //
        // inner classes
        //
        class Interval implements Comparable<Interval> {
            final float from;
            final float to;
    
            Interval(float... values) {
                Arrays.sort(values);
                this.from = values[0];
                this.to = values[values.length - 1];
            }
    
            Interval(double... values) {
                Arrays.sort(values);
                this.from = (float) values[0];
                this.to = (float) values[values.length - 1];
            }
    
            boolean combinableWith(Interval other) {
                if (this.from > other.from)
                    return other.combinableWith(this);
                if (this.to < other.from)
                    return false;
                float intersectionLength = Math.min(this.to, other.to) - other.from;
                float thisLength = this.to - this.from;
                float otherLength = other.to - other.from;
                return (intersectionLength >= thisLength * .9f) || (intersectionLength >= otherLength * .9f);
            }
    
            Interval combineWith(Interval other) {
                return new Interval(this.from, this.to, other.from, other.to);
            }
    
            @Override
            public int compareTo(Interval o) {
                return this.from == o.from ? Float.compare(this.to, o.to) : Float.compare(this.from, o.from);
            }
    
            @Override
            public String toString() {
                return String.format("[%3.2f, %3.2f]", from, to);
            }
        }
    
        interface PathElement {
        }
    
        class Rectangle implements PathElement {
            final Point2D p0, p1, p2, p3;
    
            Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
                this.p0 = p0;
                this.p1 = p1;
                this.p2 = p2;
                this.p3 = p3;
            }
        }
    
        //
        // members
        //
        final List<PathElement> path = new ArrayList<>();
        final List<Interval> horizontalLines = new ArrayList<>();
        final List<Interval> verticalLines = new ArrayList<>();
        final Logger logger = LoggerFactory.getLogger(PdfBoxFinder.class);
    }
    

    (PdfBoxFinder.java)

    Example use

    You can use the PdfBoxFinder like this to extract text from the table cells of the sample document located at FILE_PATH:

    try (   PDDocument document = PDDocument.load(FILE_PATH) ) {
        for (PDPage page : document.getDocumentCatalog().getPages()) {
            PdfBoxFinder boxFinder = new PdfBoxFinder(page);
            boxFinder.processPage(page);
    
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
            for (Map.Entry<String, Rectangle2D> entry : boxFinder.getRegions().entrySet()) {
                stripperByArea.addRegion(entry.getKey(), entry.getValue());
            }
    
            stripperByArea.extractRegions(page);
            List<String> names = stripperByArea.getRegions();
            names.sort(null);
            for (String name : names) {
                System.out.printf("[%s] %s\n", name, stripperByArea.getTextForRegion(name));
            }
        }
    }
    

    (ExtractBoxedText test testExtractBoxedTexts)

    The start of the output:

    [A1] Nr. 
    crt. 
    
    [A2] Nume şi prenume 
    
    [A3] Titlul lucrării 
    
    [A4] Coordonator ştiinţific 
    
    [A5] Ora 
    
    [B1] 1. 
    
    [B2] SFETCU I. JESSICA-
    LARISA 
    
    [B3] Analiza fluxurilor de date twitter 
    
    [B4] Conf. univ. dr. Frîncu Marc 
    Eduard 
     
    
    [B5] 8:00 
    
    [C1] 2. 
    
    [C2] TARBA V. IONUȚ-
    ADRIAN 
    
    [C3] Test me - rest api folosind java şi 
    play framework 
    
    [C4] Conf.univ.dr. Fortiş Teodor 
    Florin 
     
    
    [C5] 8:12 
    

    The first page of the document:

    这篇关于从具有不同高度的表格行中提取pdf(使用pdfbox库的java)中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆