如何将pdf中的图像坐标转换为JSONfile? [英] How can I get Images coordinates in pdf into JSONfile?
问题描述
我已编码创建html页面包含图像,以pdf文档提取页面。
I have coded creating html page included images extracting a page in pdf document.
我曾尝试从pdf中提取图像,然后我成功从pdf中提取图像并使用PDFBox lib将图像应用于html页面。但我没有在html页面中提取图像坐标。
I had tried to extract images from pdf and then I succeeded to extract images from pdf and to apply the images to html page using PDFBox lib. but I did not extract image coordinates in html page.
如此搜索如何提取pdf中的图像坐标,我尝试使用PDFBox库提取pdf中的图像坐标。
So searched how to extract image coordinates in pdf, I tried to extract image coordinates in pdf using PDFBox Library.
代码如下:
public static void main(String[] args) throws Exception
{
try
{
PDDocument document = PDDocument.load(
"/Users/tmdtjq/Downloads/PDFTest/test.pdf" );
PrintImageLocations printer = new PrintImageLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for( int i=0; i<allPages.size(); i++ )
{
PDPage page = (PDPage)allPages.get( i );
int pageNum = i+1;
System.out.println( "Processing page: " + pageNum );
printer.processStream( page, page.findResources(),
page.getContents().getStream() );
}
}
finally
{
}
}
protected void processOperator( PDFOperator operator, List arguments ) throws IOException
{
String operation = operator.getOperation();
if( operation.equals( "Do" ) )
{
COSName objectName = (COSName)arguments.get( 0 );
Map xobjects = getResources().getXObjects();
PDXObject xobject = xobjects.get( objectName.getName() );
if( xobject instanceof PDXObjectImage )
{
try
{
PDXObjectImage image = (PDXObjectImage)xobject;
PDPage page = getCurrentPage();
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double rotationInRadians =(page.findRotation() * Math.PI)/180;
AffineTransform rotation = new AffineTransform();
rotation.setToRotation( rotationInRadians );
AffineTransform rotationInverse = rotation.createInverse();
Matrix rotationInverseMatrix = new Matrix();
rotationInverseMatrix.setFromAffineTransform( rotationInverse );
Matrix rotationMatrix = new Matrix();
rotationMatrix.setFromAffineTransform( rotation );
Matrix unrotatedCTM = ctm.multiply( rotationInverseMatrix );
float xScale = unrotatedCTM.getXScale();
float yScale = unrotatedCTM.getYScale();
float xPosition = unrotatedCTM.getXPosition();
float yPosition = unrotatedCTM.getYPosition();
System.out.println( "Found image[" + objectName.getName() + "] " +
"at " + xPosition + "," + yPosition +
" size=" + (xScale/100f*image.getWidth()) + "," + (yScale/100f*image.getHeight() ));
}
catch( NoninvertibleTransformException e )
{
throw new WrappedIOException( e );
}
}
}
}
输出打印X,Y图像中的位置是全0.0,0.0。
Outputs printing X,Y Positions in images is All 0.0, 0.0.
我认为因为getGraphicsState()是返回graphicsState的方法。
I think because getGraphicsState() is method to return the graphicsState.
但是我希望将特定的图像坐标应用于PDF页面的高度,宽度以便创建html页面。
But I want to get specific images coordinates applied to height,width of a PDF page in order to create html page.
我想也许是从PDF中的图像坐标中提取JSON的解决方案。
I think maybe it is solution to extract JSON from images coordinates in PDF.
请将PDF中的图像坐标引入JSON工具或建议PDF库。
Please introduce image coordinates in PDF to JSON tool or suggest PDF Library.
(我已经在FlexPaper中使用了pdf2json工具。这个工具从PDF页面中提取JSON文件,不包括图像数据,只提取文本数据(内容,坐标,字体..)。
(Already I used pdf2json tool in FlexPaper. this tool extracts JSONfile including not images data but just texts data(content, coordinates, font..) from PDF page.)
推荐答案
我能够找到搜索 cm
运算符的图像。
我覆盖 PDFTextStripper
以下方式:
注意:它没有考虑旋转和镜像!
I was able to find images with searching for cm
operator.
I overrided PDFTextStripper
the following way:
Note: it doesn't take into account rotation and mirroring!
public static class TextFinder extends PDFTextStripper {
public TextFinder() throws IOException {
super();
}
@Override
protected void startPage(PDPage page) throws IOException {
// process start of the page
super.startPage(page);
}
@Override
public void process(PDFOperator operator, List<COSBase> arguments)
throws IOException {
if ("cm".equals(operator.getOperation())) {
float width = ((COSNumber)arguments.get(0)).floatValue();
float height = ((COSNumber)arguments.get(3)).floatValue();
float x = ((COSNumber)arguments.get(4)).floatValue();
float y = ((COSNumber)arguments.get(5)).floatValue();
// process image coordinates
}
super.processOperator(operator, arguments);
}
@Override
protected void writeString(String text,
List<TextPosition> textPositions) throws IOException {
for (TextPosition position : textPositions) {
// process text coordinates
}
super.writeString(text, textPositions);
}
}
当然,可以使用 PDFStreamEngine
而不是 PDFTextStripper
,如果一个人不想与图像一起查找文本。
Of course, one can use PDFStreamEngine
instead of PDFTextStripper
, if one is not interested in finding text together with images.
这篇关于如何将pdf中的图像坐标转换为JSONfile?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!