apache poi 转换器的编码问题 [英] Encoding issue with apache poi converter

查看:40
本文介绍了apache poi 转换器的编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 ms word doc 文件,我正在使用 apache poi 将其转换为 html 文档.

这是我正在运行的代码

 InputStream input = new FileInputStream (path);HWPFDocument wordDocument = 新的 HWPFDocument(输入);WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());列表<图片>图片 = wordDocument.getPicturesTable().getAllPictures();如果(图片!= null){for (int i = 0; i <pics.size(); i++){图片 pic = (Picture) pics.get (i);尝试{pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension());}catch (FileNotFoundException e){e.printStackTrace();}}}wordToHtmlConverter.setPicturesManager (new PicturesManager(){public String savePicture (byte[] content, PictureType pictureType, StringSuggestName, float widthInches, float heightInches){for(图片图片名称:图片){返回 Integer.toString(picName.hashCode()) + '.'+ picName.suggestFileExtension();}返回空;}});wordToHtmlConverter.processDocument(wordDocument);文档 htmlDocument = wordToHtmlConverter.getDocument();ByteArrayOutputStream outStream = new ByteArrayOutputStream();DOMSource domSource = new DOMSource(htmlDocument);StreamResult streamResult = new StreamResult (outStream);TransformerFactory tf = TransformerFactory.newInstance();变压器序列化器 = tf.newTransformer();serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");serializer.setOutputProperty (OutputKeys.INDENT, "yes");serializer.setOutputProperty (OutputKeys.METHOD, "html");serializer.transform (domSource, streamResult);outStream.close();String html = new String (outStream.toByteArray());

代码工作正常,它保留了图像和样式.但是,html 中的某些字符似乎存在问题,它没有正确编码.例如,原始 .doc 文件中的某些项目符号样式无法正确输出.我试过多个字符集(ASCII、UTF-8、gbk ...)都没有正确生成项目符号.

我 %99% 确定由于编码,项目符号显示乱码.有人用apache遇到过这样的问题吗?

解决方案

这不是编码问题,而是字体问题.Word 使用 ANSI 代码和特殊字体作为其默认项目符号列表.例如,第一个项目符号是字体Symbol"中的项目符号.第二个要点是字体Courier New"中的一个圆圈,第三个要点是字体Wingdings"中的一个正方形.

因此,最简单的可能性就是将项目符号文本的 ANSI 代码替换为 unicode.完成后,我们可以对 HTML 使用 UTF-8.

示例:

Word WordBulletList.doc:

Java:

import java.io.StringWriter;导入 java.io.FileInputStream;导入 java.io.File;导入 java.io.PrintWriter;导入 javax.xml.transform.OutputKeys;导入 javax.xml.transform.Transformer;导入 javax.xml.transform.TransformerFactory;导入 javax.xml.transform.dom.DOMSource;导入 javax.xml.transform.stream.StreamResult;导入 javax.xml.parsers.DocumentBuilderFactory;导入 org.apache.poi.hwpf.HWPFDocument;导入 org.apache.poi.hwpf.HWPFDocumentCore;导入 org.apache.poi.hwpf.usermodel.Paragraph;导入 org.apache.poi.hwpf.converter.WordToHtmlConverter;导入 org.apache.poi.hwpf.converter.FontReplacer;导入 org.apache.poi.hwpf.converter.FontReplacer.Triplet;导入 org.w3c.dom.Document;导入 java.awt.Desktop;公共类 TestWordToHtmlConverter {public static void main(String[] args) 抛出异常 {文档 newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();WordToHtmlConverter wordToHtmlConverter = 新 WordToHtmlConverter(newDocument) {protected void processParagraph(HWPFDocumentCore hwpfDocument,org.w3c.dom.Element 父元素,int currentTableLevel,段落段落,java.lang.String bulletText) {如果 (bulletText!="") {//System.out.println((int)bulletText.charAt(0));bulletText = bulletText.replace("\uF0B7", "\u2022");bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");}super.processParagraph(hwpfDocument,parentElement,currentTableLevel,段落,bulletText);}};wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));StringWriter stringWriter = new StringWriter();变压器变压器 = TransformerFactory.newInstance().newTransformer();Transformer.setOutputProperty( OutputKeys.INDENT, "yes" );Transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");transformer.setOutputProperty(OutputKeys.METHOD, "html");转换器.transform(新的DOMSource(wordToHtmlConverter.getDocument()),新的StreamResult(stringWriter));String html = stringWriter.toString();try(PrintWriter out = new PrintWriter("WordBulletList.html")) {out.println(html);}File htmlFile = new File("WordBulletList.html");Desktop.getDesktop().browse(htmlFile.toURI());}}

HTML:

<预><代码>...<body class="b1 b2"><p class="p1"><span>单词项目符号列表:</span></p><p class="p2"><span class="s1">&bull;&nbsp;</span><span>Bullet1</span></p><p class="p2"><span class="s1">&nbsp;&nbsp;⚪ &nbsp;</span><span>Bullet2</span></p><p class="p2"><span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪ &nbsp;</span><span>Bullet3</span></p><p class="p2"><span class="s1">&nbsp;&nbsp;⚪ &nbsp;</span><span>Bullet2</span></p><p class="p2"><span class="s1">&bull;&nbsp;</span><span>Bullet1</span></p><p class="p1"><span>结束</span></p>...

I have an ms word doc file that i'm converting to an html document using apache poi.

this is the code i'm running

    InputStream input = new FileInputStream (path);
    HWPFDocument wordDocument = new HWPFDocument (input);            
    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument() );

    List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();
    if (pics != null) 
    {
        for (int i = 0; i <pics.size(); i++) 
        {
            Picture pic = (Picture) pics.get (i);
            try 
            {
                pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension()) );
            }
            catch (FileNotFoundException e) 
            {
                e.printStackTrace();
            }
        }
    }

    wordToHtmlConverter.setPicturesManager (new PicturesManager() 
    {               
        public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) 
        {
            for(Picture picName:pics)
            {
                return Integer.toString(picName.hashCode()) + '.' + picName.suggestFileExtension();
            }

            return null;
        }
    });

    wordToHtmlConverter.processDocument(wordDocument);                       
    Document htmlDocument = wordToHtmlConverter.getDocument();                        
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult (outStream);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");
    serializer.setOutputProperty (OutputKeys.INDENT, "yes");
    serializer.setOutputProperty (OutputKeys.METHOD, "html");
    serializer.transform (domSource, streamResult);
    outStream.close();

    String html = new String (outStream.toByteArray());

The code works fine, it's preserving images and styles. However there seems to be a problem with some characters in the html it's not encoding properly. For instance, some of the bullet point styles in the original .doc file are not outputting correctly. I've tried multiple characters sets (ASCII, UTF-8, gbk ...) all are not producing the bullet points correctly.

I'm %99 percent sure the bullets are showing gibberish because of the encoding. Has anyone come across a problem like this with apache?

解决方案

This is not an encoding problem but a font problem. Word uses ANSI code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings".

So the easiest possibility will be simply to replace the ANSI codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.

Example:

Word WordBulletList.doc:

Java:

import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;

import org.w3c.dom.Document;

import java.awt.Desktop;

public class TestWordToHtmlConverter {

 public static void main(String[] args) throws Exception {

  Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {

   protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                   org.w3c.dom.Element parentElement, 
                                   int currentTableLevel, 
                                   Paragraph paragraph, 
                                   java.lang.String bulletText) {
    if (bulletText!="") {
     //System.out.println((int)bulletText.charAt(0));
     bulletText = bulletText.replace("\uF0B7", "\u2022");
     bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
     bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
    }

    super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
   }

  };

  wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));

  StringWriter stringWriter = new StringWriter();
  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
  transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
  transformer.setOutputProperty( OutputKeys.METHOD, "html" );
  transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));

  String html = stringWriter.toString();

  try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
    out.println(html);
  }

  File htmlFile = new File("WordBulletList.html");
  Desktop.getDesktop().browse(htmlFile.toURI());

 }
}

HTML:

...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪​&nbsp;</span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...

这篇关于apache poi 转换器的编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆