如何将扫描的PDF格式转换为XML格式 [英] How to convert Scanned PDF to XML

查看:565
本文介绍了如何将扫描的PDF格式转换为XML格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在批量扫描PDF文档。



我想阅读扫描的PDF文档并生成XML格式。



同样,我想从修改后的XML文件更新PDF中的内容。



怎么做...

解决方案

雇用人员为你做这件事。 :)

这是一个非常具有挑战性的任务,不适用于快速回答类型的论坛。这些任务有商业应用,但一般来说不能100%准确地执行。



你需要什么:

- 一个OCR引擎(让我们假设,图像质量足够好,以及没有手写) - 一些扫描仪已经在扫描图像上面创建了一个ocr-ed层

- 你需要一个或多个模式,根据位置或一些元数据将文本映射到xml元素(假设你的文件数量有限)

- 你需要一个文件类型识别逻辑

- 你需要一个内容验证逻辑来​​了解有多好自动流程执行

- 编辑PDF是另一回事。如果扫描仪没有扫描扫描的图像,则无法编辑图像本身,必须将新文本放在原始图像上方



但这些只是基本概念。这样的任务真的是一个艰难的任务,几个月的全职轮班,最后你仍然会有特殊情况,当自动处理不起作用时,你必须添加一些用户交互,因此你需要用户接口也是如此。


Adob​​e ACROBAT 9 PRO(v.9.5.2)可以很好地从.pdf中制作.xml。它在保存对话框中有一个选项,可以将XML 1.0保存为设置;



编码,书签生成,标记生成...



并且有图像文件设置;



生成图像,使用子文件夹以及输出格式(TIFF ,JPG,PNG),甚至是缩减样本......



因此,像pdf一样复杂.pdf可以(从个人经验中了解),Adobe 最初的fonter和打印机,这个应用程序不仅使他们能够强大而成功地打包他们的专有知识。



;唯一的缺点。


I am having bulk Scanned PDF document.

I want to read Scanned PDF document and generate to XML.

Again, i want to update the content in PDF from modified XML file.

How to do this...

解决方案

Hire people to do this for you. :)
This is a really challenging task not for a "quick answers" kind of forum. There are commercial applications for such tasks, but in general it can''t be performed with 100% accuracy.

What you need:
- an OCR engine (let''s suppose, that the quality of the images is good enough, and there is no handwriting) - some scanners are already making an ocr-ed layer above the scanned image
- you need one or more patterns that map text to the xml element based on position or some metadata (supposing your documents are of a limited number of type)
- you will need a document type recognition logic
- you will need a content validation logic to have a clue how good the automatic process performed
- editing a PDF is something else. If the scanned images is not ocred by the scanner, you cannot edit the images itself, you have to put the new text above the original one

But these are only the basic concepts. Such a task is really a hard one, many months of full-time shifts, and at the end you will still have special cases, when the automatic handling will not work, thus you have to add some user interaction, thus you will need user interface too.


Adobe ACROBAT 9 PRO (v.9.5.2) does a good job of making .xml out of .pdf. It has an option in the Save dialog to save as "XML 1.0" with settings;

Encoding, bookmark generation, tag generation ...

And there''s Image File Settings;

Generate images, use sub-folder, as well as output format (TIFF,JPG,PNG), even downsample ...

So as complicated as "disassembling" a .pdf can be (knowing from personal experience), Adobe is the original "fonter" and "printer" and this app more than enables them to package their proprietary knowledge both formidably and somewhat successfully.


; the only downside.


这篇关于如何将扫描的PDF格式转换为XML格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆