从PDFBox剥离时的文本坐标 [英] Text coordinates when stripping from PDFBox

查看:94
本文介绍了从PDFBox剥离时的文本坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PDFBox从pdf文件中提取带有坐标的文本.

i'm trying to extract text with coordinates from a pdf file using PDFBox.

我混合了一些在Internet上找到的方法/信息(也有stackoverflow),但是我的坐标问题似乎并不正确.例如,当我尝试使用坐标在tex顶部绘制矩形时,将rect绘制在其他位置.

I mixed some methods/info found on internet (stackoverflow too), but the problem i have the coordinates doesnt'seems to be right. When i try to use coordinates for drawing a rectangle on top of tex, for example, the rect is painted elsewhere.

这是我的代码(请不要判断样式,只是为了测试而编写的非常快)

This is my code (please don't judge the style, was written very fast just to test)

TextLine.java

    import java.util.List;
    import org.apache.pdfbox.text.TextPosition;

    /**
     *
     * @author samue
     */
    public class TextLine {
        public List<TextPosition> textPositions = null;
        public String text = "";
    }

myStripper.java

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;

    /*
     * To change this license header, choose License Headers in Project Properties.
     * To change this template file, choose Tools | Templates
     * and open the template in the editor.
     */

    /**
     *
     * @author samue
     */
    public class myStripper extends PDFTextStripper {
        public myStripper() throws IOException
        {
        }

        @Override
        protected void startPage(PDPage page) throws IOException
        {
            startOfLine = true;
            super.startPage(page);
        }

        @Override
        protected void writeLineSeparator() throws IOException
        {
            startOfLine = true;
            super.writeLineSeparator();
        }

        @Override
        public String getText(PDDocument doc) throws IOException
        {
            lines = new ArrayList<TextLine>();
            return super.getText(doc);
        }

        @Override
        protected void writeWordSeparator() throws IOException
        {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();

            super.writeWordSeparator();
        }


        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextLine tmpline = null;

            if (startOfLine) {
                tmpline = new TextLine();
                tmpline.text = text;
                tmpline.textPositions = textPositions;
                lines.add(tmpline);
            } else {
                tmpline = lines.get(lines.size() - 1);
                tmpline.text += text;
                tmpline.textPositions.addAll(textPositions);
            }

            if (startOfLine)
            {
                startOfLine = false;
            }
            super.writeString(text, textPositions);
        }

        boolean startOfLine = true;
        public ArrayList<TextLine> lines = null;

    }

单击AWT按钮上的事件

 private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {                                      
    // TODO add your handling code here:
    try {
        File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
        PDDocument doc = PDDocument.load(file);

        myStripper stripper = new myStripper();

        stripper.setStartPage(1); // fix it to first page just to test it
        stripper.setEndPage(1);
        stripper.getText(doc);

        TextLine line = stripper.lines.get(1); // the line i want to paint on

        float minx = -1;
        float maxx = -1;

        for (TextPosition pos: line.textPositions)
        {
            if (pos == null)
                continue;

            if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
                minx = pos.getTextMatrix().getTranslateX();
            }
            if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
                maxx = pos.getTextMatrix().getTranslateX();
            }
        }

        TextPosition firstPosition = line.textPositions.get(0);
        TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();
        float w = (maxx - minx) + lastPosition.getWidth();
        float h = lastPosition.getHeightDir();

        PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);

        contentStream.setNonStrokingColor(Color.RED);
        contentStream.addRect(x, y, w, h);
        contentStream.fill();
        contentStream.close();

        File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
        doc.save(fileout);
        doc.close();
    } catch (Exception ex) {

    }
}                                     

有什么建议吗?我在做什么错了?

any suggestion? what am i doing wrong?

推荐答案

这只是过度的PdfTextStripper坐标归一化的另一种情况.就像您曾经想过的那样,使用TextPosition.getTextMatrix()(而不是getX()getY)可以得到实际的坐标,但是没有,即使是这些矩阵值也必须进行校正(至少在PDFBox 2.0.x中,我没有检查1.8.x),因为矩阵乘以一个平移,使裁剪框的左下角成为原点.

This is just another case of the excessive PdfTextStripper coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix() (instead of getX() and getY) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven't checked 1.8.x) because the matrix is multiplied by a translation making the lower left corner of the crop box the origin.

因此,在您的情况下(其中裁剪框的左下角不是原点),您必须更正值,例如通过替换

Thus, in your case (in which the lower left of the crop box is not the origin), you have to correct the values, e.g. by replacing

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();

作者

        PDRectangle cropBox = doc.getPage(0).getCropBox();

        float x = minx + cropBox.getLowerLeftX();
        float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();

代替

您现在得到

但是,显然,您还必须对高度进行一些校正.这是由于PdfTextStripper确定文本高度的方式造成的:

Obviously, though, you will also have to correct the height somewhat. This is due to the way the PdfTextStripper determines the text height:

    // 1/2 the bbox is used as the height todo: why?
    float glyphHeight = bbox.getHeight() / 2;

(来自LegacyPDFStreamEngine中的showGlyph(...),父类为PdfTextStripper)

(from showGlyph(...) in LegacyPDFStreamEngine, the parent class of PdfTextStripper)

虽然字体边界框通常确实太大,但往往只有一半是不够的.

While the font bounding box indeed usually is too large, half of it often is not enough.

这篇关于从PDFBox剥离时的文本坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆