PDFBox:提取图像位置(错误的 x 和 y) [英] PDFBox: extract image location (wrong x and y)

查看:38
本文介绍了PDFBox:提取图像位置(错误的 x 和 y)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

再次问候各位程序员.

我可以正确提取 PDF 文本坐标及其格式.但我不能用图像来做到这一点.我可以获得正确的宽度和高度,但它给了我错误的 xy.

I can extract PDF text coordinates and its format properly. But I can't do it with image. I can get the proper width and height but it gives me wrong x and y.

我正在使用 Photoshop 检查我是否得到了正确的 xywidthheight 坐标,但只有 widthheight 是正确的

I'm using Photoshop to check if I'm getting the proper x, y, width, height coordinates, but only the width and height are correct

这是我的代码:

@Override
public void processOperator(Operator operator, List<COSBase> arguments) throws IOException {
    if ("cm".equals(operator.getName())) {
        float width = ((COSNumber)arguments.get(0)).floatValue();
        float height = ((COSNumber)arguments.get(3)).floatValue();
        float x = ((COSNumber)arguments.get(4)).floatValue();
        float y = ((COSNumber)arguments.get(5)).floatValue();
        System.out.println("w: " + width + " h: " + height + " x: " + x + " y: " + y);
        // process image coordinates
    }

    super.processOperator(operator, arguments);
}

这是我使用的示例 PDF:

And here is the example PDF I used:

http://persci.mit.edu/pub_pdfs/personal_photo_enhancement.pdf

我正在使用第 2 页.

and I'm using the page 2.

这是程序的输出:

宽:503.87997 小时:152.64 x:71.5168 y:561.056

w: 503.87997 h: 152.64 x: 71.5168 y: 561.056

我使用 Photoshop 创建了一个矩形并覆盖了图像,但只有宽度和高度是正确的.

I created a rectangle using Photoshop and overlay the image but only the width and height are correct.

我用过这个 PDF

http://www.ctex.org/documents/shredder/src/example.pdf

我使用了第 17 页.

I used the page 17.

为什么PDF显示了很多坐标,而PDF中的图像只有一个?

Why does the PDF show many coordinates, but the image in the PDF is only one?

w: 1.0 h: 1.0 x: 124.802 y: 776.998
w: 1.0 h: 1.0 x: 0.0 y: 3.587
w: 1.0 h: 1.0 x: 0.0 y: -3.985
w: 1.0 h: 1.0 x: 343.711 y: 0.398
w: 1.0 h: 1.0 x: -343.711 y: -24.906
w: 1.0 h: 1.0 x: 147.972 y: -106.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 0.1 h: 0.1 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 45.0 y: 0.0
w: 1.0 h: 1.0 x: -79.37 y: -21.918
w: 1.0 h: 1.0 x: 116.507 y: 0.0
w: 1.0 h: 1.0 x: -230.109 y: -2.145
w: 1.0 h: 1.0 x: 0.0 y: -20.324
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 179.886 y: -66.21
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -215.552 y: -17.195
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: -35.666 y: -76.173
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -41.843
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -51.806
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 175.592 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -185.554 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: -37.121
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 282.916 y: -18.389
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -318.582 y: -17.196
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 11.988 y: -11.216
w: 1.0 h: 1.0 x: 0.0 y: -14.833
w: 1.0 h: 1.0 x: 3.388 y: 4.926
w: 1.0 h: 1.0 x: 60.357 y: -4.926
w: 1.0 h: 1.0 x: -63.745 y: -0.399
w: 1.0 h: 1.0 x: 63.944 y: -3.985
w: 1.0 h: 1.0 x: -59.959 y: 0.0
w: 1.0 h: 1.0 x: 64.143 y: 0.0
w: 1.0 h: 1.0 x: -110.801 y: -13.101
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: 39.308 y: 2.241
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: -37.066 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.294
w: 1.0 h: 1.0 x: 1.145 y: -9.907
w: 1.0 h: 1.0 x: 39.641 y: 11.302
w: 1.0 h: 1.0 x: 0.0 y: -15.686
w: 1.0 h: 1.0 x: 1.693 y: 14.291
w: 1.0 h: 1.0 x: 0.0 y: -12.896
w: 1.0 h: 1.0 x: 3.288 y: 2.989
w: 1.0 h: 1.0 x: 47.544 y: -2.989
w: 1.0 h: 1.0 x: -50.832 y: -0.299
w: 1.0 h: 1.0 x: 52.227 y: -1.096
w: 1.0 h: 1.0 x: -53.92 y: -0.597
w: 1.0 h: 1.0 x: 57.838 y: 14.888
w: 1.0 h: 1.0 x: 0.0 y: -11.22
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: 42.751 y: 2.473
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: -40.278 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.693
w: 1.0 h: 1.0 x: 1.313 y: -9.907
w: 1.0 h: 1.0 x: -104.652 y: -78.762
w: 1.0 h: 1.0 x: 166.874 y: 0.0
w: 1.0 h: 1.0 x: 176.837 y: 0.0

推荐答案

问题原因

您的代码不会真正寻找图像位置和大小,只是在友好的情况下找到它们.

The cause of the problems

Your code does not really look for image positions and sizes, merely under friendly circumstances it finds them.

您的代码只显示了一个没有明确上下文的方法(我认为这就是为什么没有人认真分析该代码并发现问题的原因).

Your code only shows a single method without explicit context (which, I presume, is the reason why no one seriously analyzed that code and spotted the issue).

考虑到上下文(PDFBox、内容流分析),我假设您创建了一个操作符处理器类,在该类中您根据发布的内容覆盖了 processOperator 方法代码.此外,我假设您使用某些 PDF 流引擎为 cm 指令注册了操作员处理器,并针对您的示例 PDF 运行该指令.

Considering the context (PDFBox, content stream analysis), though, I assume that you created an operator processor class in which you overrode the processOperator method according to the posted code. Furthermore, I assume, you registered your operator processor for the cm instruction with some PDF stream engine and ran that against your sample PDFs.

鉴于这些假设,很明显为什么您的操作员处理器的输出有时仅包含图像大小和位置,但通常包含许多不相关的数据集:

Given these assumptions it is pretty clear why the output from your operator processor only sometimes contains image size and position but often many unrelated data sets:

cm指令的作用仅仅是改变当前的变换矩阵,与绘制位图图像没有直接或单一的关系!

The effect of the instruction cm is merely to change the current transformation matrix, it is not immediately or singularly related to drawing bitmap images!

授予 PDF 规范:

操作数操作员说明

a b c d e f厘米通过连接指定的矩阵来修改当前的变换矩阵 (CTM)(见 8.3.2,坐标空间").虽然操作数指定了一个矩阵,但它们应该写成六个单独的数字,而不是一个数组.

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(表 57 – 图形状态运算符 – ISO 32000-1)

cm 参数偶尔包含图像大小和位置信息的唯一原因是位图绘制运算符将图像绘制到 1x1 区域(以用户空间为单位),其左下角角是原点,为了拉伸和移动坐标系,使该区域最终对应于结果页面上所需的图像大小,PDF 处理器在绘制之前使用 cm 指令相应地修改当前变换矩阵图像,通常就在之前.

The only reason why the cm parameters every once in a while do contain image size and position information is that the bitmap drawing operators draw images to an 1x1 area (in user space unit) whose lower left corner is the origin, and to stretch and move the coordinate system so that this area eventually corresponds to desired image size on the result page, PDF processors modify the current transformation matrix accordingly using the cm instruction before drawing the image, often right before.

如果他们在一个步骤中这样做(如上面引用的cm 将指定的矩阵连接到 CTM,它不会不替换它)并且不要使用旋转或类似的细节,ad(第一个和第四个 cm 参数)确实包含图像的大小在页面上(默认用户空间单位)和 ef(第五个和第六个 cm 参数)包含其下方的坐标左角.

If they do so in one step (as quoted above cm concatenates the specified matrix to the CTM, it does not replace it) and don't use rotations or similar niceties, a and d (the first and the fourth cm parameters) indeed contain the size of the image on the page (in default user space units) and e and f (the fifth and the sixth cm parameters) contain the coordinates of its lower left corner.

因此,与其只看cm参数,不如

Thus, instead of merely looking at the cm parameters, one has to

  • 解析有问题的内容流,
  • 计算应用于 CTM 的所有矩阵的串联(同时跟踪中间 qQ 指令的效果),以​​及
  • 当位图图像资源的Do指令发生时,检索当前变换矩阵的值.
  • parse the content stream in question,
  • calculate the concatenation of all matrices applied to the CTM (also keeping track of the effects of intermediary q and Q instructions), and
  • retrieve the values of the current transformation matrix when the Do instruction for a bitmap image resource occurs.

幸运的是,如果您愿意,PDFBox 已经在幕后为您完成了所有繁重的工作,参见.

Fortunately PDFBox already does all the heavy lifting for you under the hood if you let it, cf. the PrintImageLocations examples at

就 PDF 坐标系而言,您为personal_photo_enhancement.pdf"第 2 页获得的坐标是正确的.可能 Photoshop 使用了不同的坐标系,或者您检查了错误的图像角.

The coordinates you got for "personal_photo_enhancement.pdf" page 2 were correct as far as the PDF coordinate system is concerned. Probably Photoshop uses a different coordinate system or you inspected the wrong image corner.

example.pdf"第 17 页的输出非常多,因为该 PDF 使用 CTM 操作不仅用于调整图像的大小和定位,还用于其他效果,主要用于平移坐标系原点.此外,该页面上的图像不是位图.因此,它没有简单的位置和大小...

You got very many outputs for "example.pdf" page 17 because that PDF uses CTM manipulations not only for sizing and positioning images but for other effects, too, mostly for translating the coordinate system origin. Futhermore, the image on that page is not a bitmap. Thus, it does not have a simple position and size...

这篇关于PDFBox:提取图像位置(错误的 x 和 y)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆