使用itext从pdf获取所选区域的坐标 [英] getting co ordinates of selected area from pdf using itext

查看:2053
本文介绍了使用itext从pdf获取所选区域的坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从PDF的特定部分删除文本。如果我知道该区域的X,Y坐标,我就可以删除该文本。但是我无法从PDF中获取所选区域的坐标。请帮助我。

解决方案

这个问题是您上一个问题的后续问题:



更新:



在评论中,您说明您转换PDF页面在图像中,您在Java Swing应用程序中渲染图像,以便用户可以选择矩形。此矩形存储为 java.awt.Image



这会导致以下潜在问题事实上,Java中的坐标系与PDF中的坐标系不同。


  1. Y轴不同:在PDF中,页面的大小用矩形描述,我们称之为页面边界。最重要的页面边界是MediaBox(必需)和CropBox(可选)。 MediaBox包含定义页面的矩形的左下角和右上角的坐标。在坐标系中,Y轴指向上方。左下角的Y坐标低于右上角的Y坐标。在Java中,它是相反的:对象顶部的Y坐标为0,Y轴指向下方:Y值越高,此Y值处的对象越低。

  2. 可能存在偏移:在大多数情况下,MediaBox的左下角有坐标X = 0,Y = 0.但情况并非总是如此。可能需要考虑偏移量。

  3. 分辨率可能不同:默认用户单位对应一个点。例如:A4页面由842个用户单位测量595个。每英寸有72个点。创建图像时,不一定要以磅为单位进行测量。也许你用像素衡量。也许你创建一个每英寸300像素(300 dpi)的图像。

所有这些原因都可能导致你从Swing获得的矩形应用程序与您需要在PDF中使用的坐标不同。你需要考虑所有这些,否则,你会继续面对你它不起作用的问题。这不是iText问题,这是一个数学问题。


I'm trying to remove text from a particular section of a PDF. If I know the X,Y co-ordinates of the area, I'm able to remove the text. But I'm unable to get the co-ordinates of the selected area from PDF. Kindly help me.

解决方案

This question is a follow-up of your previous question: Remove text occurrences contained in a specified area with iText

In this question, you ask about removing content from a specific area. Now you are asking how to determine this specific area, but your question is incomplete: you are not telling us any of the criteria to select the area.

It seems that you are trying to do something that is called redaction. This is explained in the StackOverflow question: How to create and apply redactions?

In the answer to that question, I explain how to create redaction annotations programmatically. However, redaction is usually done manually, using Adobe Acrobat:

The arrow shows the functionality you need: Tools > Protection > Mark for Redaction

If you only need the coordinates and no redaction annotation, you could introduce another annotation that allows you to mark a rectangle manually and then use iText to extract the coordinates. For instance: if the rectangle is a form field, then it's really easy to get the coordinates. If the content you want to remove is a value of the form field, it's even easier to remove that content: you just remove the field.

If there is no way to retrieve these coordinates manually, then you may be facing something that is impossible: for instance: if you don't know anything about the content of the area you want to remove, how on earth are you going to teach a program what it needs to remove?

If you do know what content you're looking for, you have to parse for that content. That question has been asked and answered before: Get the exact Stringposition in PDF

Update:

In the comments, you explain that you convert the PDF page to an image, that you render the image in a Java Swing application so that a user can select a rectangle. This rectangle is stored as a java.awt.Image.

This leads to the following potential problems due to the fact that the coordinate system in Java is different from the coordinate system in PDF.

  1. The Y-axis is different: In PDF, the size of the page is described in rectangles that we call page boundaries. The most important page boundaries are the MediaBox (mandatory) and the CropBox (optional). The MediaBox contains the coordinates of the lower-left corner and the upper-right corner of the rectangle that defines your page. In the coordinate system, the Y-axis points upwards. The Y coordinate of the lower-left corner is lower than the Y coordinate of the upper-right corner. In Java, it's the other way around: the Y coordinate at the top of an object is 0 and the Y-axis points downwards: the higher the Y value, the lower the object at this Y value.
  2. There may be an offset: In most cases, the lower-left corner of the MediaBox has the coordinate X = 0, Y = 0. This isn't always the case. It may be necessary to take into account an offset.
  3. The resolution can be different: The default user unit corresponds with a point. For instance: an A4 page measures 595 by 842 user units. There are 72 points in every inch. When you create an image, you don't necessarily measure in points. Maybe you measure in pixels. Maybe you create an image with 300 pixels per inch (300 dpi).

All these reasons can cause the rectangle you get from your Swing app to be different from the coordinates you need to use in PDF. You need to take all of this into account, otherwise, you'll keep on facing you "it doesn't work" problem. This is not an iText problem, this is a Math problem.

这篇关于使用itext从pdf获取所选区域的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆