从PDFBox剥离时的文本坐标 [英] Text coordinates when stripping from PDFBox
问题描述
我正在尝试使用PDFBox从pdf文件中提取带有坐标的文本.
i'm trying to extract text with coordinates from a pdf file using PDFBox.
我混合了一些在Internet上找到的方法/信息(也有stackoverflow),但是我的坐标问题似乎并不正确.例如,当我尝试使用坐标在tex顶部绘制矩形时,将rect绘制在其他位置.
I mixed some methods/info found on internet (stackoverflow too), but the problem i have the coordinates doesnt'seems to be right. When i try to use coordinates for drawing a rectangle on top of tex, for example, the rect is painted elsewhere.
这是我的代码(请不要判断样式,只是为了测试而编写的非常快)
This is my code (please don't judge the style, was written very fast just to test)
TextLine.java
import java.util.List;
import org.apache.pdfbox.text.TextPosition;
/**
*
* @author samue
*/
public class TextLine {
public List<TextPosition> textPositions = null;
public String text = "";
}
myStripper.java
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* @author samue
*/
public class myStripper extends PDFTextStripper {
public myStripper() throws IOException
{
}
@Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}
@Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}
@Override
public String getText(PDDocument doc) throws IOException
{
lines = new ArrayList<TextLine>();
return super.getText(doc);
}
@Override
protected void writeWordSeparator() throws IOException
{
TextLine tmpline = null;
tmpline = lines.get(lines.size() - 1);
tmpline.text += getWordSeparator();
super.writeWordSeparator();
}
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextLine tmpline = null;
if (startOfLine) {
tmpline = new TextLine();
tmpline.text = text;
tmpline.textPositions = textPositions;
lines.add(tmpline);
} else {
tmpline = lines.get(lines.size() - 1);
tmpline.text += text;
tmpline.textPositions.addAll(textPositions);
}
if (startOfLine)
{
startOfLine = false;
}
super.writeString(text, textPositions);
}
boolean startOfLine = true;
public ArrayList<TextLine> lines = null;
}
单击AWT按钮上的事件
private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {
// TODO add your handling code here:
try {
File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
PDDocument doc = PDDocument.load(file);
myStripper stripper = new myStripper();
stripper.setStartPage(1); // fix it to first page just to test it
stripper.setEndPage(1);
stripper.getText(doc);
TextLine line = stripper.lines.get(1); // the line i want to paint on
float minx = -1;
float maxx = -1;
for (TextPosition pos: line.textPositions)
{
if (pos == null)
continue;
if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
minx = pos.getTextMatrix().getTranslateX();
}
if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
maxx = pos.getTextMatrix().getTranslateX();
}
}
TextPosition firstPosition = line.textPositions.get(0);
TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);
float x = minx;
float y = firstPosition.getTextMatrix().getTranslateY();
float w = (maxx - minx) + lastPosition.getWidth();
float h = lastPosition.getHeightDir();
PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);
contentStream.setNonStrokingColor(Color.RED);
contentStream.addRect(x, y, w, h);
contentStream.fill();
contentStream.close();
File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
doc.save(fileout);
doc.close();
} catch (Exception ex) {
}
}
有什么建议吗?我在做什么错了?
any suggestion? what am i doing wrong?
推荐答案
这只是过度的PdfTextStripper
坐标归一化的另一种情况.就像您曾经想过的那样,使用TextPosition.getTextMatrix()
(而不是getX()
和getY
)可以得到实际的坐标,但是没有,即使是这些矩阵值也必须进行校正(至少在PDFBox 2.0.x中,我没有检查1.8.x),因为矩阵乘以一个平移,使裁剪框的左下角成为原点.
This is just another case of the excessive PdfTextStripper
coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix()
(instead of getX()
and getY
) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven't checked 1.8.x) because the matrix is multiplied by a translation making the lower left corner of the crop box the origin.
因此,在您的情况下(其中裁剪框的左下角不是原点),您必须更正值,例如通过替换
Thus, in your case (in which the lower left of the crop box is not the origin), you have to correct the values, e.g. by replacing
float x = minx;
float y = firstPosition.getTextMatrix().getTranslateY();
作者
PDRectangle cropBox = doc.getPage(0).getCropBox();
float x = minx + cropBox.getLowerLeftX();
float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();
代替
您现在得到
但是,显然,您还必须对高度进行一些校正.这是由于PdfTextStripper
确定文本高度的方式造成的:
Obviously, though, you will also have to correct the height somewhat. This is due to the way the PdfTextStripper
determines the text height:
// 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight() / 2;
(来自LegacyPDFStreamEngine
中的showGlyph(...)
,父类为PdfTextStripper
)
(from showGlyph(...)
in LegacyPDFStreamEngine
, the parent class of PdfTextStripper
)
虽然字体边界框通常确实太大,但往往只有一半是不够的.
While the font bounding box indeed usually is too large, half of it often is not enough.
这篇关于从PDFBox剥离时的文本坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!