从PDF中提取带坐标和大小的图像和单词 [英] Extract Images and Words with coordinates and sizes from PDF

查看:272
本文介绍了从PDF中提取带坐标和大小的图像和单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了很多关于PDF提取和库(如iText)的内容,但我还没有找到从PDF中提取图像和文本(带坐标)的解决方案。

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.

任务是使用产品目录扫描PDF并提取每个图像。每个图像旁边都印有图像代码,还有图像上显示的产品的产品代码列表。

The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.

我知道没有办法从这样的PDF中提取结构化信息,但是使用所有图像和文本对象的坐标,我可以编写代码来识别链接文本与图像的距离。然后我可以使用RegExp拆分文本,找出什么是产品代码,什么是图像代码等。

I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.

你能为这项任务推荐一个好的工作解决方案吗?

Could you recommend a good and working solution for the task?

推荐答案

使用XPDF( http://www.foolabs.com/xpdf/

它可以使用坐标( pdftotext -bbox [sourcefile] [outputfile] )以及PDF中的所有图像和SVG提取PDF中的所有字符。

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

它是开源的(GPLv2),并且还支持许多其他提取功能。

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

这篇关于从PDF中提取带坐标和大小的图像和单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆