访问“替代文本"通过 PDFBox 获取图像 [英] Accessing "alternate text" for an image via PDFBox

查看:55
本文介绍了访问“替代文本"通过 PDFBox 获取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有某种方法可以使用 PDFBox 为特定图像提取替代文本"?

Is there some way to extract "alternate text" for a specific image using PDFBox?

我有一个 PDF 文件,如 http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1,已将替代文本添加到图像中.使用 PDFBox 我可以通过 PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() 通过对象模型找到图像本身(PDXObjectImage),但我看不到任何从图像中获取的方法自己到它的替代文本.

I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.

可以在 http://dl.dropbox.com/u/12253279/image_test_pass.pdf(应该说这是图像的替代文本.").

A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

推荐答案

我不知道如何/是否可以使用 PDFBox 完成此功能,但我可以告诉您,此功能与 PDF 规范中名为 Logical 的部分有关结构/标记 PDF,并非所有 PDF 工具都完全支持.

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.

假设您使用的工具支持它,您将必须按照 4 个主要步骤来检索此信息(我将使用您发布的示例 PDF 文件进行以下说明).

Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).

假设您可以访问 PDF 文件的内部结构,您需要:

Assuming you have access to the internal structure of the PDF file, you will need to:

1- 解析页面内容并找到包装您感兴趣的图像的 Tag 元素的 MCID 号.

1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.

页面内容:

BT
/P <</MCID 0 >>BDC 
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC 
ET
/Figure <</MCID 1 >>BDC 
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC 

您的图片:

2- 在页面对象中,检索键 StructParents.

2- In the page object, retrieve the key StructParents.

3- 现在检索结构树(Catalog 对象的键 StructTreeRoot,它是每个 PDF 文件中的根对象),以及在其中的父树.

3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.

4- ParentTree 以一个数组开始,您可以在其中找到元素对(有关更多详细信息,请参阅 PDF 规范中的数字树).在这个特定的树中,每对的第一个元素是一个数值,对应于在步骤 2 中检索到的 StructParents 键,第二个元素是一个对象数组,其中索引对应于在步骤 1 中检索到的 MCID 值.所以, 您将在此处搜索与您的图像的 MCID 值对应的元素,您将找到一个 PDF 对象.在此对象中,您将找到替代文本.

4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.

看起来很简单,是不是?

Looks easy, isn't it?

本回答中使用的工具:
PDF Vole(基于 iText)
Amyuni PDF Analyzer

这篇关于访问“替代文本"通过 PDFBox 获取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆