将doc / docx转换为语义HTML [英] Convert doc/docx to semantic HTML
问题描述
有些愿望/要求:
- >
-
语义HTML,使得文档中的标题是< h1>,< h2>等等,表格是< table>等等。
-
最好能够处理标题,列表,表格和图像。图形和数学公式是一个很好的补充。
Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.
Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.
•不必直接从doc / docx转换为html,可以使用中间格式,例如xml或docbook。
•应该以编程方式工作,并且包含大量文档。
到目前为止我找到的解决方案最接近的是 http:/ /holloway.co.nz/docvert/index.html ,但不幸的是,有许多错误,小用户群,它不能处理大量文件。更多的是一个概念证明。
有一个名为 upCast ,它能够将Word文档转换为XML。
I would like to convert doc/docx documents to semantic HTML.
Some wishes/requirements:
• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.
• Should work programatically, and with large number of documents.
The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.
There's a tool called upCast which is able to convert Word documents into XML.
这篇关于将doc / docx转换为语义HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!