将 PDF 转换为 HTML,保持布局 [英] Transform PDF to HTML, keep layout

查看:28
本文介绍了将 PDF 转换为 HTML,保持布局的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有哪些方法可以将 PDF 转换为 HTML?它可以是任何东西——在线服务、软件、图书馆.(首选开源.在最后一种情况下,首选 php 或 python.)它必须保留原始布局(包括页码、脚注等),保留图像(可以将它们组合为每页一个背景图像)并保留链接.它最好输出有效的 XHTML 并清理 PDF 功能,例如连字,但如果需要进行一些后期处理,我可以接受.带有干净、相对语义化的 HTML 输出的东西会很棒.

What methods are there to transform a PDF to HTML? It could be anything - online service, software, library. (Opensource preferred. In the last case, php or python would be preferred.) It has to keep the original layout (including page numbers, footnotes and such), keep the images (combining them to one single background image per page is acceptable) and keep the links. It should preferably output valid XHTML and clean up PDF features such as ligatures, but if there is some post-processing required, I can live with that. Something with a clean, relatively semantic HTML output would be great.

我找到的最接近的一个是 zamzar.org,但它被链接阻塞了.(此外,HTML 输出是一堆丑陋的绝对定位 div,由于编码问题需要进行后期处理.)

The closest one I found was zamzar.org, but it choked on links. (Also, the HTML output is an ugly heap of absolutely positioned divs and needs post-processing because of encoding problems.)

推荐答案

我使用了 iText 库,并且我发现解析 PDF 结构很好(我用它来搜索文本).这是一个解析 PDF 并从中创建对象模型的库,因此您需要编写 HTML 生成器的代码,但应该不会太难.

I worked with iText library, and I found it good to parse the PDF structure (I used it to search for text). It's a library that parses a PDF and creates an object model out of it, so you will need to code the HTML generator, but it should be not too difficult.

这篇关于将 PDF 转换为 HTML,保持布局的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆