Apache POI - 使用图像将* .doc转换为* .html [英] Apache POI - converting *.doc to *.html with images

查看:135
本文介绍了Apache POI - 使用图像将* .doc转换为* .html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个DOC文件包含一些图像。如何将其转换为带图像的HTML?

There is a DOC file that contains some image. How to convert it to HTML with image?

我试图使用这个例子:
用Java编程将Word文档转换为HTML

I tried to use this example: Convert Word doc to HTML programmatically in Java

public class Converter {
    ...

    private File docFile, htmlFile;

    try {
        FileInputStream fos = new FileInputStream(docFile.getAbsolutePath()); 
        HWPFDocument doc = new HWPFDocument(fos);       
        Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
        wordToHtmlConverter.processDocument(doc);

        StringWriter stringWriter = new StringWriter();

        Transformer transformer = TransformerFactory.newInstance().newTransformer();        
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.transform(
                    new DOMSource(wordToHtmlConverter.getDocument()),
                    new StreamResult(stringWriter)
        );

        String html = stringWriter.toString();

        try {
            BufferedWriter out = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
            );     
            out.write(html);
            out.close();
       } catch (IOException e) {
           e.printStackTrace();
       }

       JEditorPane jEditorPane = new JEditorPane();
       jEditorPane.setContentType("text/html");
       jEditorPane.setEditable(false);
       jEditorPane.setPage(htmlFile.toURI().toURL());

       JScrollPane jScrollPane = new JScrollPane(jEditorPane);

       JFrame jFrame = new JFrame("display html file");
       jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
       jFrame.getContentPane().add(jScrollPane);
       jFrame.setSize(512, 342);
       jFrame.setVisible(true);

    } catch(Exception e) {
        e.printStackTrace();
    }
    ...
}

但图像是丢失了。

WordToHtmlConverter 类的nofollow noreferrer>文档说明如下:

The documentation for the WordToHtmlConverter class says the following:


...此实现不会创建图像或链接。这个
可以通过覆盖
来更改 AbstractWordConverter.processImage(Element,boolean,Picture)
method。

...this implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

如何将DOC转换为带图像的HTML?

How to convert DOC to HTML with images?

推荐答案

在这种情况下,您最好的选择是使用 Apache Tika ,并让它为您包装Apache POI。 Apache Tika将为您的文档生成HTML(或纯文本,但您希望为您的案例提供HTML)。除此之外,它还将为嵌入式资源提供占位符,为嵌入式图像添加img标签,并为您提供获取嵌入式资源和图像内容的方法。

Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

在Alfresco中有一个非常好的例子, HTMLRenderingEngine 。您可能希望查看那里的代码,然后编写自己的代码以执行非常类似的操作。那里的代码包括一个自定义的ContentHandler,它允许编辑img标签,重新编写src属性,你可能需要也可能不需要,这取决于你要写出图像的位置。

There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.

这篇关于Apache POI - 使用图像将* .doc转换为* .html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆