如何从左下角到左上角更改pdf页面中文本的坐标 [英] how to change the coordiantes of a text in a pdf page from lower left to upper left

查看:171
本文介绍了如何从左下角到左上角更改pdf页面中文本的坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDFBOX和itextsharp dll并处理pdf。
这样我就可以得到矩形内文本的文本坐标。使用itextsharp.dll提取矩形坐标。
基本上我从itextsharp.dll获取矩形坐标,其中itextsharp使用坐标系统作为左下角。我从PDFBOX获取pdf页面文本,其中PDFBOX使用坐标系统作为左上角。
我需要帮助将坐标从左下角转换为左上角

i am using PDFBOX and itextsharp dll and processing a pdf. so that i get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the itextsharp.dll. Basically i get the rectangle coordinates from itextsharp.dll, where itextsharp uses the coordinates system as lower left. And i get the pdf page text from PDFBOX, where PDFBOX uses the coordinates system as top upper left. I need help in converting the Coordinates from lower left to upper left

提前致谢

更新我的问题

请原谅我,如果你理解我的问题,如果没有提供完整的信息..

Pardon me if u dint understood my question and if not full information was provided..

好吧,让我尝试从开始提供更多细节。

well, Let me try to give more details from start.

我正在开发一个工具,我得到一个PDF,其中一个矩形是在评论部分中使用一些绘图标记绘制。现在我正在使用iTextsharp读取矩形坐标

I am working on a tool where i get a PDF in which a rectangle is drawn using some Drawing markups within a comment section. Now i am reading the rectangle coordinates using iTextsharp

PdfDictionary pageDict = pdReader.GetPageN(page_no);
PdfArray annotArray = pageDict.GetAsArray(PdfName.ANNOTS);

其中pdReader是PdfReader。

where pdReader is PdfReader.

和使用PDFBOX提取页面文本及其坐标。在哪里我有一个类创建了pdfBoxTextExtraction在这个i处理文本和坐标使得它返回文本和 llx,lly,urx,ury 逐行请逐行注意而不是句子明智。

And the page text along with its coordinates is extracted using PDFBOX. where as i have a class created pdfBoxTextExtraction in this i process the text and coordinate such that it returns the text and llx,lly,urx,ury "line by line" please note line by line not sentence wise.

所以我想提取位于Rectangle坐标内的文本。当从itextsharp返回矩形的坐标时,我陷入困境,即llx,lly,urx,矩形的ury在左下角有一个原点,因为从PDFBOX返回的文本坐标的原点是左上角。然后我意识到我需要调整y轴,使原点从左下角移动到左上角。因为我得到了页面的高度和庄稼的高度

So i want to extract the text that lays within the Rectangle coordinates. I got stuck when the coordinates of the rectangle returned from itextsharp i.e llx,lly,urx,ury of a rectangle has an origin at lower left where as the text coordinates returned from PDFBOX has an origin at upper left .then i realised i need to adjust the y-axis so that the origin moves from lower left to upper left. for the i got the height of the page and height of the cropbox

iTextSharp.text.Rectangle mediabox = reader.GetPageSize(page_no);
iTextSharp.text.Rectangle cropbox = reader.GetCropBox(page_no);

做了一些基本调整


lly = mediabox.Top - lly

lly=mediabox.Top - lly

ury = mediabox.Top - ury

ury=mediabox.Top - ury

在某些情况下调整有效,而在某些PDF中需要对cropbox进行调整

in some case the adjustment worked, whereas in some PDFs needed to do adjustment on cropbox


lly = cropbox。 Top - lly

lly=cropbox .Top - lly

ury = cropbox .Top - ury

ury=cropbox .Top - ury

其中as在一些PDF文件上工作。

where as on some PDFs dint worked.

我需要的是帮助调整矩形坐标以便我得到矩形内的文本

All i need is help in adjusting the rectangle coordinates so that i get the text within the rectangle

希望这很清楚。如果不是请原谅我并要求同样的

Hope this is clear enough.If not please pardon me and ask for the same

谢谢

推荐答案

PDF中的坐标系在ISO-32000-1中定义。该ISO标准解释了X轴朝向右侧,而Y轴朝向上方。这是默认值。这些是iText返回的坐标(幕后,iText解析所有CTM变换)。

The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).

如果你想改变iText返回的坐标,你就得到了在Y轴具有向下方向的坐标系中的坐标,例如,您可以从页面顶部的Y坐标中减去iText返回的Y值。

If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.

示例:假设我们正在处理A4页面,其中底部的Y坐标为0,顶部的Y坐标为842.如果您有Y坐标,例如 y1 = 806 y2 = 36 ,然后你可以这样做:

An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806 and y2 = 36, then you can do this:

y = 842 - y;

现在 y1 = 36 y2 = 806 。您刚刚使用简单的高中数学来改变Y轴的方向。

Now y1 = 36 and y2 = 806. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.

基于额外评论更新

每个页面都有一个媒体框。这定义了最重要的页面边界。其他页面边界可能存在,但它们都不会超过媒体框(如果有,那么您的PDF违反了ISO-32000-1)。

Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).

裁剪框定义页面的可见区域。默认情况下(例如,如果缺少裁剪框条目),裁剪框与媒体框重合。

The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.

在评论中,您说从中减去llx高度。这是不正确的。 llx 是左下角的 x 坐标,而高度是在 Y 轴上测量的属性,除非页面已旋转。您是否检查过页面词典是否有 / Rotate 值?

In your comment, you say that you subtract llx from the height. This is incorrect. llx is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate value?

您还声称iText返回的值与PdfBox返回的值不匹配。请注意,iText返回的值符合ISO标准定义的坐标系。如果PdfBox不遵循这个标准,你应该问PdfBox的人为什么他们没有遵循标准,而他们正在使用的坐标系。

You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.

也许这就是mkl的评论。他写道:

Maybe that's what mkl's comment is about. He wrote:


Y'= Ymax - Y. X'= X - Xmin。

Y' = Ymax - Y. X' = X - Xmin.

也许PdfBox搜索最大Y值 Ymax 和最小X值 Xmin 然后在所有坐标上应用上述转换。如果要渲染PDF,这是一个有用的转换,但如果要使用坐标,例如在页面上相对于文本的特定位置添加内容,则执行此类操作是不明智的(因为转换后的坐标为no更长的PDF坐标)。

Maybe PdfBox searches for the maximum Y value Ymax and the minimum X value Xmin and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).

备注:

你说你需要PdfBox获取页面的文本。为什么需要这个额外的工具? iText完全能够提取和重新排序页面上的文本(假设您使用正确的提取策略)。如果没有,请澄清。

You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.

  • Note that we recently decided to support Type3 fonts, although we weren't convinced that this makes sense (see Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!) to understand why not).
  • What some consider "wrong extraction" can often be "wrong interpretation" of what is extracted as explained in this mailing-list answer: http://thread.gmane.org/gmane.comp.java.lib.itext.general/66829/focus=66830
  • There are other cases where we follow the spec, leading to results that are different than what PdfBox returns. Watch https://www.youtube.com/watch?v=wxGEEv7ibHE for more info.

这篇关于如何从左下角到左上角更改pdf页面中文本的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆