将PDF转换为XML,然后再次返回PDF [英] PDF to XML and back to PDF again

查看:125
本文介绍了将PDF转换为XML,然后再次返回PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我问了一个有关如何将PDF文件转换为XML文件,然后将其返回给PDF文件的问题,最好与原始文件完全相同,但至少几乎相同.

Well I recently asked a question about getting a PDF-file to become an XML-file and then return it to a PDF-file preferably exactly the same as the original, but at least almost the same.

我一直在尝试不同的方法,到目前为止,我想到了这个方法.

I've been trying different methods and so far I came up with this one.

  1. 使用LibreOffice编写的文档将另存为DocBook XML.假设它名为"file.xml".
  2. 此文件是由DocBook项目中由文件docbook.xsl发起的一组XSL模板解析的.
  3. 这可以通过运行以下命令来完成:xsltproc -o middle-fo-file.fo/usr/share/xml/docbook/stylesheet/nwalsh/fo/docbook.xsl file.xml
  4. 结果是一个中间XSL-FO,通过运行以下命令将其转换为PDF:fop middle-fo-file.fo final.pdf
  5. 此PDF文件看上去与原始的ODT文件几乎相同.

但是,仍然说一开始我有一个PDF文件,如何做同样的事情?有什么建议吗?

But still, say I have a PDF-file in the beginning, how could the same thing be done? Any suggestions?

推荐答案

从PDF到XML的无损转换的唯一机会是使用目标XML词汇表,该词汇表具有与PDF相同的视图.由于PDF的文档视图主要集中在(而不是仅)表示形式上,并且诸如Docbook之类的XML词汇表设计的通常动机是捕获更高级别的抽象,因此您面临两个困难:(1)面向表示形式的XML词汇表并不丰富(2)如果您想从PDF转向更传统的XML词汇表(直接或通过面向演示的XML),您将大步向前,尝试以目标词汇表的高级抽象.充其量,要使这样的过程自动化将非常困难.

The only chance of a lossless conversion from PDF to XML is to use a target XML vocabulary which has the same view of documents that PDF has. Since PDF's view of documents is focused primarily if not exclusively on presentation, and the usual motivation for the design of XML vocabularies like Docbook is to capture higher-level abstractions, you face two difficulties: (1) presentation-oriented XML vocabularies are not thick on the ground, and (2) if you want to go from PDF to a more conventional XML vocabulary (either directly or via a presentation-oriented XML) you will be pushing water uphill, trying to interpret the presentation of the document in terms of the higher-level abstractions of your target vocabulary. It will be very difficult, at best, to automate such a process.

如果这是一种思想实验,并且您正在考虑使用PDF-XML-PDF往返行程以查看何时以及如何实现可能性,那么您现在就知道了某些人会认为在任何情况下都不可能实现的原因.形式.如果出于某些实际原因想要此PDF到PDF数据流,则可能需要思考一下是否可以通过其他方式实现您的实际目标.

If this is a kind of thought experiment and you are thinking about the PDF-XML-PDF round trip to see when and how it's possible, then you now know the reasons some people will give for believing it's not possible in any general form. If you want this PDF-to-PDF data flow for some practical reason, you might want to reflect on whether your practical goals can be met in some other way.

这篇关于将PDF转换为XML,然后再次返回PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆