从PDF中提取带坐标和大小的图像和单词 [英] Extract Images and Words with coordinates and sizes from PDF
问题描述
我已经阅读了很多关于PDF提取和库(如iText)的内容,但我还没有找到从PDF中提取图像和文本(带坐标)的解决方案。
I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
任务是使用产品目录扫描PDF并提取每个图像。每个图像旁边都印有图像代码,还有图像上显示的产品的产品代码列表。
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
我知道没有办法从这样的PDF中提取结构化信息,但是使用所有图像和文本对象的坐标,我可以编写代码来识别链接文本与图像的距离。然后我可以使用RegExp拆分文本,找出什么是产品代码,什么是图像代码等。
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
你能为这项任务推荐一个好的工作解决方案吗?
Could you recommend a good and working solution for the task?
推荐答案
使用XPDF( http://www.foolabs.com/xpdf/ )
它可以使用坐标( pdftotext -bbox [sourcefile] [outputfile]
)以及PDF中的所有图像和SVG提取PDF中的所有字符。
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]
) and also all the images and SVGs in the PDF.
它是开源的(GPLv2),并且还支持许多其他提取功能。
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.
这篇关于从PDF中提取带坐标和大小的图像和单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!