如何使用PDFBox在pdf中查找表格边框线? [英] How to find table border lines in pdf using PDFBox?
问题描述
我正在尝试查找pdf中的表格边框线.我使用pdfBox的PrintTextLocations
类制作单词.现在,我正在寻找构成表格的不同线条的坐标.我尝试使用org.apache.pdfbox.pdfviewer.PageDrawer
,但是找不到包含这些行的任何字符/图形.我尝试了两种方法:
第一:
Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);
它给了我空指针异常.因此,第二,我尝试覆盖processStream
函数,但是我无法获得任何笔触.请帮我.我可以使用任何其他可以为我提供表中各行的坐标的库.另一个快速的问题是,pdfbox中的那些表边界线是什么样的对象?这些是图形还是这些字符?
这是我尝试解析的样本pdf的链接: http://stats.bls.gov/news.release/pdf/empsit.pdf 并尝试获取第8页上的表格行.
编辑:我遇到另一个问题,在解析此pdf的页码1时,尽管strokePath()
函数的pathIterator
为空,但我却无法获得任何行每行都被调用.如何使用此pdf文件?
在1.8.*版本中,PDFBox解析功能的实现不是非常通用的,特别是OperatorProcessor
实现与特定的解析器类紧密相关. ,例如假定与PageDrawer
实例进行交互的处理路径绘制操作的实现.
因此,除非有人想复制&将所有这些OperatorProcessor
类都进行细微的粘贴,就必须从这样一个特定的解析器类派生出来.
因此,在您对路径绘制操作感兴趣之后,我们也将从PageDrawer
派生我们的解析器:
public class PrintPaths extends PageDrawer
{
//
// constructor
//
public PrintPaths() throws IOException
{
super();
}
//
// method overrides for mere path observation
//
// ignore text
@Override
protected void processTextPosition(TextPosition text) { }
// ignore bitmaps
@Override
public void drawImage(Image awtImage, AffineTransform at) { }
// ignore shadings
@Override
public void shFill(COSName shadingName) throws IOException { }
@Override
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
{
PDRectangle cropBox = aPage.findCropBox();
this.pageSize = cropBox.createDimension();
super.processStream(aPage, resources, cosStream);
}
@Override
public void fillPath(int windingRule) throws IOException
{
printPath();
System.out.printf("Fill; windingrule: %s\n\n", windingRule);
getLinePath().reset();
}
@Override
public void strokePath() throws IOException
{
printPath();
System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
getLinePath().reset();
}
void printPath()
{
GeneralPath path = getLinePath();
PathIterator pathIterator = path.getPathIterator(null);
double x = 0, y = 0;
double coords[] = new double[6];
while (!pathIterator.isDone()) {
switch (pathIterator.currentSegment(coords)) {
case PathIterator.SEG_MOVETO:
System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_LINETO:
double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_QUADTO:
System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
x = coords[2];
y = coords[3];
break;
case PathIterator.SEG_CUBICTO:
System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
x = coords[4];
y = coords[5];
break;
case PathIterator.SEG_CLOSE:
System.out.println("Close path");
}
pathIterator.next();
}
}
double getEffectiveWidth(double dirX, double dirY)
{
if (dirX == 0 && dirY == 0)
return 0;
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double widthX = dirY;
double widthY = -dirX;
double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
return getGraphicsState().getLineWidth() * factor;
}
}
( PrintPaths.java )
由于我们不想真正绘制页面,而只是提取将要绘制的路径,因此我们必须像这样剥离PageDrawer
. /p>
此示例解析器输出路径绘制操作以显示如何执行此操作.显然,您可以改为收集它们进行自动化处理...
您可以像这样使用解析器:
PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8
System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}
输出为:
Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
很奇特:垂直线实际上是绘制为非常短(约0.6单位),非常粗(约513单位)的水平线...
I am trying to find table border lines in pdf. I used PrintTextLocations
class of pdfBox to make words. Now I am looking to find the coordinates of different lines that form the table. I tried using org.apache.pdfbox.pdfviewer.PageDrawer
, but I am unable to find any character/graphic containing those lines. I tried two ways:
First:
Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);
It gave me null pointer exception. So secondly, I tried to override processStream
function, but I am unable to get any stroke. Kindly help me out. I am open in using any other library which gives me coordinates of the lines in the table. And another quick question, what kind of objects are those table border lines in pdfbox? Are these graphics or are these characters?
Here is the link to the sample pdf I am trying to parse: http://stats.bls.gov/news.release/pdf/empsit.pdf and trying to get the table lines on page number 8.
Edit : I faced another problem, while parsing this pdf's page number 1, I am unable to get any lines as the pathIterator
in printPath()
function is empty, although strokePath()
function is called for each line. How to work with this pdf?
In the 1.8.* versions PDFBox parsing capabilities had been implemented in a not very generic way, in particular the OperatorProcessor
implementations were tightly associated with specific parser classes, e.g. the implementations dealing with path drawing operations assumed to interact with a PageDrawer
instance.
Thus, unless one wanted to copy & paste all those OperatorProcessor
classes with minute changes, one had to derive from such a specific parser class.
In your case, therefore, we also will derive our parser from PageDrawer
, after all we are interested in path drawing operations:
public class PrintPaths extends PageDrawer
{
//
// constructor
//
public PrintPaths() throws IOException
{
super();
}
//
// method overrides for mere path observation
//
// ignore text
@Override
protected void processTextPosition(TextPosition text) { }
// ignore bitmaps
@Override
public void drawImage(Image awtImage, AffineTransform at) { }
// ignore shadings
@Override
public void shFill(COSName shadingName) throws IOException { }
@Override
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
{
PDRectangle cropBox = aPage.findCropBox();
this.pageSize = cropBox.createDimension();
super.processStream(aPage, resources, cosStream);
}
@Override
public void fillPath(int windingRule) throws IOException
{
printPath();
System.out.printf("Fill; windingrule: %s\n\n", windingRule);
getLinePath().reset();
}
@Override
public void strokePath() throws IOException
{
printPath();
System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
getLinePath().reset();
}
void printPath()
{
GeneralPath path = getLinePath();
PathIterator pathIterator = path.getPathIterator(null);
double x = 0, y = 0;
double coords[] = new double[6];
while (!pathIterator.isDone()) {
switch (pathIterator.currentSegment(coords)) {
case PathIterator.SEG_MOVETO:
System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_LINETO:
double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_QUADTO:
System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
x = coords[2];
y = coords[3];
break;
case PathIterator.SEG_CUBICTO:
System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
x = coords[4];
y = coords[5];
break;
case PathIterator.SEG_CLOSE:
System.out.println("Close path");
}
pathIterator.next();
}
}
double getEffectiveWidth(double dirX, double dirY)
{
if (dirX == 0 && dirY == 0)
return 0;
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double widthX = dirY;
double widthY = -dirX;
double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
return getGraphicsState().getLineWidth() * factor;
}
}
As we do not want to actually draw the page but merely extract the paths which would be drawn, we have to strip down the PageDrawer
like this.
This sample parser outputs path drawing operations to show how to do it. Obviously you can instead collect them for automatized processing...
You can use the parser like this:
PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8
System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}
The output is:
Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Quite peculiar: The vertical lines actually are drawn as very short (ca 0.6 units) very thick (ca 513 units) horizontal lines...
这篇关于如何使用PDFBox在pdf中查找表格边框线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!