.DOC转换为使用Apache POI在Java中.HTML [英] Converting .doc to .html in Java using Apache POI

查看:141
本文介绍了.DOC转换为使用Apache POI在Java中.HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要转换的文档 .DOC 包含一些图像。如何将其转换为 *。HTML ,以便图像将保持相同的位置?如何存储在名为单独的文件夹图片的图像,并使用该文件夹作为图像的来源?

我的code:

 进口java.io.BufferedWriter中;
进口java.io.DataOutputStream中;
进口的java.io.File;
进口java.io.FileInputStream中;
进口java.io.FileOutputStream中;
进口java.io.IOException异常;
进口java.io.OutputStreamWriter中;
进口java.io.StringWriter中;
进口javax.swing.JEditorPane中;
进口javax.swing.JFrame中;
进口javax.swing.JScrollPane中;
进口javax.xml.parsers.DocumentBuilderFactory中;
进口javax.xml.transform.OutputKeys;
javax.xml.transform.Transformer中的进口;
进口javax.xml.transform.TransformerFactory中;
进口javax.xml.transform.dom.DOMSource中;
javax.xml.transform.stream.StreamResult中的进口;
进口org.apache.poi.hwpf.HWPFDocument;
进口org.apache.poi.hwpf.converter.WordToHtmlConverter;
进口org.apache.poi.hwpf.extractor.WordExtractor;
进口org.apache.poi.xwpf.converter.core.FileImageExtractor;
进口org.apache.poi.xwpf.converter.core.FileURIResolver;
进口org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
进口org.w3c.dom.Document中;公共类TestWordToHtmlConverter {
    私人文件DOCFILE;
    私人档案文件;    公共TestWordToHtmlConverter(文件DOCFILE){
        this.docFile = DOCFILE;
    }    公共无效转换(档案文件){
    this.file =文件;        尝试{
            的FileInputStream finStream =新的FileInputStream(docFile.getAbsolutePath());
            HWPFDocument DOC =新HWPFDocument(finStream);
            WordExtractor wordExtract =新WordExtractor(DOC);
            文档新建文档= DocumentBuilderFactory.newInstance().newDocumentBuilder()新建文档()。
            WordToHtmlConverter wordToHtmlConverter =新WordToHtmlConverter(新建文档);
            wordToHtmlConverter.processDocument(DOC);            StringWriter的StringWriter的=新的StringWriter();
            变压器变压器= TransformerFactory.newInstance()newTransformer()。            transformer.setOutputProperty(OutputKeys.INDENT,是);
            transformer.setOutputProperty(OutputKeys.ENCODING,UTF-8);
            transformer.setOutputProperty(OutputKeys.METHOD,HTML);
            transformer.transform(新为DOMSource(wordToHtmlConverter.getDocument()),新StreamResult(StringWriter的));            串的html = stringWriter.toString();
            FOS的FileOutputStream =新的FileOutputStream(新文件(HTML / sample.html));
            DataOutputStream类DOS;            尝试{
                BufferedWriter将出=新的BufferedWriter(新OutputStreamWriter(FOS,UTF-8));
                out.write(HTML);
                out.close();
            }
            赶上(IOException异常五){
               e.printStackTrace();
            }           / * JEditorPane中editorPane =新的JEditorPane();
           editorPane.setContentType(text / html的);
           editorPane.setEditable(假);           editorPane.setPage(file.toURI()的toURL());           JScrollPane的滚动窗格=新JScrollPane的(editorPane);
           JFrame的F =新的JFrame(显示HTML文件);
           f.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
           f.getContentPane()加(滚动面板)。
           f.setSize(512,342);
           f.setVisible(真); * /        }赶上(例外五){
            e.printStackTrace();
        }
    }    公共静态无效的主要(字符串ARGS []){
        TestWordToHtmlConverter TTC =新TestWordToHtmlConverter(新文件(DOCX /了Sample.doc));
        TTC.convert(TTC.docFile);
   }
}


  

这实现不创建图像或链接到他们。这可以
  通过覆盖AbstractWordConverter.processImage(要素被改变,
  布尔,图片)方法



解决方案

作为API文档说:


  

WordToHtmlConverter 不产生图像或它们的链接。这可以
  通过覆盖 AbstractWordConverter.processImage(元素,布尔,资料图片)方法来改变。


如何重写方法,你可以在这里找到:

您可以尝试使用基于Apache POI XWPF DOCX 2 XHTML转换器:

您也可以使用的Apache提卡,建的的Apache POI 。其中包括在露天可以在这里找到:

还有很多其他的转换器。

I want to convert a document .doc that contains some images. How to convert it to *.html, so that the images will remain same position? How to store those images in separate folder named image and use this folder as a source for image?

My code:

import java.io.BufferedWriter;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.StringWriter;
import javax.swing.JEditorPane;
import javax.swing.JFrame;
import javax.swing.JScrollPane;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.w3c.dom.Document;

public class TestWordToHtmlConverter {
    private File docFile;
    private File file;

    public TestWordToHtmlConverter(File docFile) {
        this.docFile = docFile;
    }

    public void convert(File file) {
    this.file = file;

        try {
            FileInputStream finStream=new FileInputStream(docFile.getAbsolutePath()); 
            HWPFDocument doc=new HWPFDocument(finStream);
            WordExtractor wordExtract=new WordExtractor(doc);
            Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument();
            WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ;
            wordToHtmlConverter.processDocument(doc);

            StringWriter stringWriter = new StringWriter();
            Transformer transformer = TransformerFactory.newInstance().newTransformer();

            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            transformer.setOutputProperty(OutputKeys.METHOD, "html");
            transformer.transform(new DOMSource( wordToHtmlConverter.getDocument()), new StreamResult( stringWriter ) );

            String html = stringWriter.toString();
            FileOutputStream fos=new FileOutputStream(new File("html/sample.html"));
            DataOutputStream dos;

            try {
                BufferedWriter out = new BufferedWriter(new OutputStreamWriter(fos,"UTF-8"));    
                out.write(html);
                out.close();
            }
            catch (IOException e) {
               e.printStackTrace();
            }

           /*JEditorPane editorPane = new JEditorPane();
           editorPane.setContentType("text/html");
           editorPane.setEditable(false);

           editorPane.setPage(file.toURI().toURL());

           JScrollPane scrollPane = new JScrollPane(editorPane);     
           JFrame f = new JFrame("Display Html File");
           f.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
           f.getContentPane().add(scrollPane);
           f.setSize(512, 342);
           f.setVisible(true);*/

        } catch(Exception e) {
            e.printStackTrace();
        }
    }  

    public static void main(String args[]) {
        TestWordToHtmlConverter TTC=new TestWordToHtmlConverter(new File("docx/sample.doc"));
        TTC.convert(TTC.docFile);         
   }
}

This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method

解决方案

As said in API docs:

WordToHtmlConverter doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

How to override method you can found here:

You can try using DOCX 2 XHTML converter based on Apache POI XWPF:

Also you can use Apache Tika, built on top of Apache POI. An example that included in Alfresco can be found here:

There are also many other converters.

这篇关于.DOC转换为使用Apache POI在Java中.HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆