在Linux上将MSword转换为XML/HTML [英] Convert MSword to XML/HTML on Linux

查看:130
本文介绍了在Linux上将MSword转换为XML/HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将MSWord文件转换为XML或HTML,同时保留文件(主要是表)的结构.我碰巧发现了tika,它在从MSword文件(和任何文件)中提取文本方面非常强大,如下所示:

I need to convert MSWord file into XML or HTML, while preserving the structure of the file (mainly tables). I happened to find tika, which is quite powerful in extracting text from MSword files (and any files), as follows:

curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text

,然后我可以从选项中进行选择,以将输出保存到html/XML中,如下所示:

and I can select from the options to save the output into html/XML, as follows:

curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --html

但是输出基本上类似于用HTML编写的纯文本,因此不可能获得表结构和其他文档元素.

But the output is basically like a plain text written in HTML, so it is not possible to get the table structure and other document elements.

在Perl或Python中是否有Tika的任何实现,可以在维护其元素结构的同时将文档转换为XML/HTML?还是Linux上还有其他工具可以做到这一点?

Is there any implementation of Tika, in Perl or Python, where it is possible to convert the document into XML/HTML while maintining the structure of its elements? Or is there any other tool on linux that can do this?

推荐答案

安装OpenOffice SDK,它为所有类型的文档(包括转换)提供了强大的API.

Install OpenOffice SDK, it offers powerfull API for all kinds of documents (including conversions).

http://www.oooforum.org/forum/viewtopic.phtml? t = 7242

这篇关于在Linux上将MSword转换为XML/HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆