Apache POI转换器的编码问题 [英] Encoding issue with apache poi converter

查看:105
本文介绍了Apache POI转换器的编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ms word文档文件,我正在使用apache poi转换为html文档.

I have an ms word doc file that i'm converting to an html document using apache poi.

这是我正在运行的代码

    InputStream input = new FileInputStream (path);
    HWPFDocument wordDocument = new HWPFDocument (input);            
    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument() );

    List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();
    if (pics != null) 
    {
        for (int i = 0; i <pics.size(); i++) 
        {
            Picture pic = (Picture) pics.get (i);
            try 
            {
                pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension()) );
            }
            catch (FileNotFoundException e) 
            {
                e.printStackTrace();
            }
        }
    }

    wordToHtmlConverter.setPicturesManager (new PicturesManager() 
    {               
        public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) 
        {
            for(Picture picName:pics)
            {
                return Integer.toString(picName.hashCode()) + '.' + picName.suggestFileExtension();
            }

            return null;
        }
    });

    wordToHtmlConverter.processDocument(wordDocument);                       
    Document htmlDocument = wordToHtmlConverter.getDocument();                        
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult (outStream);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");
    serializer.setOutputProperty (OutputKeys.INDENT, "yes");
    serializer.setOutputProperty (OutputKeys.METHOD, "html");
    serializer.transform (domSource, streamResult);
    outStream.close();

    String html = new String (outStream.toByteArray());

该代码可以正常工作,它可以保留图像和样式.但是,似乎某些字符在html中存在问题,它的编码方式不正确.例如,原始.doc文件中的某些项目符号样式无法正确输出.我尝试了多个字符集(ASCII,UTF-8,gbk ...)都无法正确产生项目符号点.

The code works fine, it's preserving images and styles. However there seems to be a problem with some characters in the html it's not encoding properly. For instance, some of the bullet point styles in the original .doc file are not outputting correctly. I've tried multiple characters sets (ASCII, UTF-8, gbk ...) all are not producing the bullet points correctly.

我%99%确信由于编码,项目符号显示乱码.有没有人遇到过类似这样的问题?

I'm %99 percent sure the bullets are showing gibberish because of the encoding. Has anyone come across a problem like this with apache?

推荐答案

这不是编码问题,而是字体问题. Word 使用 ANSI 代码和特殊字体作为默认项目符号列表.例如,第一个项目符号点是字体"Symbol"中的项目符号.第二个项目符号点是字体"Courier New"的圆圈,第三个项目符号点是字体"Wingdings"的正方形.

This is not an encoding problem but a font problem. Word uses ANSI code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings".

因此,最简单的可能性就是将项目符号文本的 ANSI 代码替换为unicode.这样就可以对HTML使用UTF-8了.

So the easiest possibility will be simply to replace the ANSI codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.

示例:

Word WordBulletList.doc :

Java:

import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;

import org.w3c.dom.Document;

import java.awt.Desktop;

public class TestWordToHtmlConverter {

 public static void main(String[] args) throws Exception {

  Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {

   protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                   org.w3c.dom.Element parentElement, 
                                   int currentTableLevel, 
                                   Paragraph paragraph, 
                                   java.lang.String bulletText) {
    if (bulletText!="") {
     //System.out.println((int)bulletText.charAt(0));
     bulletText = bulletText.replace("\uF0B7", "\u2022");
     bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
     bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
    }

    super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
   }

  };

  wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));

  StringWriter stringWriter = new StringWriter();
  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
  transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
  transformer.setOutputProperty( OutputKeys.METHOD, "html" );
  transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));

  String html = stringWriter.toString();

  try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
    out.println(html);
  }

  File htmlFile = new File("WordBulletList.html");
  Desktop.getDesktop().browse(htmlFile.toURI());

 }
}

HTML:

...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪​&nbsp;</span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...

这篇关于Apache POI转换器的编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆