如何在pdf页面中将文本的坐标从左下更改为左上 [英] How to change the coordinates of a text in a pdf page from lower left to upper left

查看:920
本文介绍了如何在pdf页面中将文本的坐标从左下更改为左上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDFBOX和itextsharp dll并正在处理pdf. 这样我就可以得到矩形内文本的文本坐标.使用itextsharp.dll提取矩形坐标. 基本上,我从itextsharp.dll获取矩形坐标,其中itextsharp使用坐标系统作为左下角.我从PDFBOX获得pdf页面文本,其中PDFBOX使用坐标系作为左上角. 我需要将坐标从左下转换为左上的帮助

I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the itextsharp.dll. Basically I get the rectangle coordinates from itextsharp.dll, where itextsharp uses the coordinates system as lower left. And I get the pdf page text from PDFBOX, where PDFBOX uses the coordinates system as top upper left. I need help in converting the Coordinates from lower left to upper left

更新我的问题

如果您不理解我的问题,并且没有提供完整的信息,请原谅我.

Pardon me if you didn't understood my question and if not full information was provided.

好吧,让我尝试从一开始就提供更多细节.

well, Let me try to give more details from start.

我正在使用一个工具来获取PDF,该PDF使用注释部分中的某些绘图标记在其中绘制矩形.现在,我正在使用iTextsharp读取矩形坐标

I am working on a tool where I get a PDF in which a rectangle is drawn using some Drawing markups within a comment section. Now I am reading the rectangle coordinates using iTextsharp

PdfDictionary pageDict = pdReader.GetPageN(page_no);
PdfArray annotArray = pageDict.GetAsArray(PdfName.ANNOTS);

其中pdReader是PdfReader.

where pdReader is PdfReader.

然后使用PDFBOX提取页面文本及其坐标.在这里,我有一个创建pdfBoxTextExtraction的类,在这里我处理文本并进行协调,以使其返回文本和 llx,lly,urx,ury 逐行";请逐行注意而不是明智的做法.

And the page text along with its coordinates is extracted using PDFBOX. where as I have a class created pdfBoxTextExtraction in this I process the text and coordinate such that it returns the text and llx,lly,urx,ury "line by line" please note line by line not sentence wise.

因此,我想提取位于Rectangle坐标内的文本.当从itextsharp返回的矩形的坐标(即llx,lly,urx,ury)的原点位于左下时,我陷入困境,因为PDFBOX返回的文本坐标的原点位于左上方.然后,我意识到我需要调整y轴,以使原点从左下方移动到左上方.因为我得到了页面的高度和裁剪框的高度

So I want to extract the text that lays within the Rectangle coordinates. I got stuck when the coordinates of the rectangle returned from itextsharp i.e llx,lly,urx,ury of a rectangle has an origin at lower left where as the text coordinates returned from PDFBOX has an origin at upper left .then I realised I need to adjust the y-axis so that the origin moves from lower left to upper left. for the I got the height of the page and height of the cropbox

iTextSharp.text.Rectangle mediabox = reader.GetPageSize(page_no);
iTextSharp.text.Rectangle cropbox = reader.GetCropBox(page_no);

进行了一些基本调整

lly = mediabox.Top-莉莉

lly=mediabox.Top - lly

ury = mediabox.Top-ury

ury=mediabox.Top - ury

在某些情况下,该调整有效,而在某些PDF中,需要对cropbox进行调整

in some case the adjustment worked, whereas in some PDFs needed to do adjustment on cropbox

lly = cropbox .Top-lly

lly=cropbox .Top - lly

ury = cropbox .Top-ury

ury=cropbox .Top - ury

在某些PDF上无效的地方.

where as on some PDFs didn't worked.

我需要的是帮助调整矩形坐标,以便在矩形内获得文本.

All I need is help in adjusting the rectangle coordinates so that I get the text within the rectangle.

推荐答案

PDF中的坐标系在ISO-32000-1中定义.该ISO标准说明X轴朝向右,而Y轴朝上.这是默认值.这些是iText返回的坐标(在幕后,iText解析所有CTM转换).

The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).

如果要变换iText返回的坐标,以便在Y轴向下的坐标系中获取坐标,则可以从顶部的Y坐标减去iText返回的Y值.页面的内容.

If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.

示例::假设我们正在处理一个A4页面,其中底部的Y坐标为0,顶部的Y坐标为842.如果您有Y坐标,例如y1 = 806y2 = 36,则可以执行以下操作:

An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806 and y2 = 36, then you can do this:

y = 842 - y;

现在y1 = 36y2 = 806.您仅使用简单的中学数学就改变了Y轴的方向.

Now y1 = 36 and y2 = 806. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.

根据其他评论进行更新:

每个页面都有一个媒体框.这定义了最重要的页面边界.可能存在其他页面边界,但是它们都不能超出媒体框(如果超出,则说明您的PDF违反了ISO-32000-1).

Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).

裁切框定义页面的可见区域.默认情况下(例如,如果缺少裁剪框条目),则裁剪框与媒体框重合.

The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.

在评论中,您说从高度减去llx.这是不正确的. llx是左下 x 坐标,而高度是在 Y 轴上测量的属性,除非旋转页面.您是否检查过页面词典是否具有/Rotate值?

In your comment, you say that you subtract llx from the height. This is incorrect. llx is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate value?

您还声称iText返回的值与PdfBox返回的值不匹配.请注意,iText返回的值符合ISO标准定义的坐标系.如果PdfBox不遵循该标准,则应询问PdfBox中的人为什么不遵循该标准,并使用什么坐标系.

You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.

也许这就是mkl的评论.他写道:

Maybe that's what mkl's comment is about. He wrote:

Y'= Ymax-Y. X'= X-Xmin.

Y' = Ymax - Y. X' = X - Xmin.

也许PdfBox搜索最大Y值Ymax和最小X值Xmin,然后将上述变换应用于所有坐标.如果要渲染PDF,这是一个有用的转换,但是如果要使用坐标,则执行这样的操作是不明智的,例如,要在相对于页面文本的特定位置添加内容(因为转换后的坐标不是更长的"PDF"坐标).

Maybe PdfBox searches for the maximum Y value Ymax and the minimum X value Xmin and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).

备注:

您说您需要PdfBox来获取页面文本.为什么需要这个额外的工具? iText非常有能力提取页面上的文本并对其重新排序(假设您使用正确的提取策略).如果没有,请澄清.

You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.

  • Note that we recently decided to support Type3 fonts, although we weren't convinced that this makes sense (see Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!) to understand why not).
  • What some consider "wrong extraction" can often be "wrong interpretation" of what is extracted as explained in this mailing-list answer: http://thread.gmane.org/gmane.comp.java.lib.itext.general/66829/focus=66830
  • There are other cases where we follow the spec, leading to results that are different than what PdfBox returns. Watch https://www.youtube.com/watch?v=wxGEEv7ibHE for more info.

这篇关于如何在pdf页面中将文本的坐标从左下更改为左上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆