apache poi 转换器的编码问题 [英] Encoding issue with apache poi converter
问题描述
我有一个 ms word doc 文件,我正在使用 apache poi 将其转换为 html 文档.
这是我正在运行的代码
InputStream input = new FileInputStream (path);HWPFDocument wordDocument = 新的 HWPFDocument(输入);WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());列表<图片>图片 = wordDocument.getPicturesTable().getAllPictures();如果(图片!= null){for (int i = 0; i <pics.size(); i++){图片 pic = (Picture) pics.get (i);尝试{pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension());}catch (FileNotFoundException e){e.printStackTrace();}}}wordToHtmlConverter.setPicturesManager (new PicturesManager(){public String savePicture (byte[] content, PictureType pictureType, StringSuggestName, float widthInches, float heightInches){for(图片图片名称:图片){返回 Integer.toString(picName.hashCode()) + '.'+ picName.suggestFileExtension();}返回空;}});wordToHtmlConverter.processDocument(wordDocument);文档 htmlDocument = wordToHtmlConverter.getDocument();ByteArrayOutputStream outStream = new ByteArrayOutputStream();DOMSource domSource = new DOMSource(htmlDocument);StreamResult streamResult = new StreamResult (outStream);TransformerFactory tf = TransformerFactory.newInstance();变压器序列化器 = tf.newTransformer();serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");serializer.setOutputProperty (OutputKeys.INDENT, "yes");serializer.setOutputProperty (OutputKeys.METHOD, "html");serializer.transform (domSource, streamResult);outStream.close();String html = new String (outStream.toByteArray());
代码工作正常,它保留了图像和样式.但是,html 中的某些字符似乎存在问题,它没有正确编码.例如,原始 .doc 文件中的某些项目符号样式无法正确输出.我试过多个字符集(ASCII、UTF-8、gbk ...)都没有正确生成项目符号.
我 %99% 确定由于编码,项目符号显示乱码.有人用apache遇到过这样的问题吗?
这不是编码问题,而是字体问题.Word
使用 ANSI
代码和特殊字体作为其默认项目符号列表.例如,第一个项目符号是字体Symbol"中的项目符号.第二个要点是字体Courier New"中的一个圆圈,第三个要点是字体Wingdings"中的一个正方形.
因此,最简单的可能性就是将项目符号文本的 ANSI
代码替换为 unicode.完成后,我们可以对 HTML 使用 UTF-8.
示例:
Word WordBulletList.doc
:
Java:
import java.io.StringWriter;导入 java.io.FileInputStream;导入 java.io.File;导入 java.io.PrintWriter;导入 javax.xml.transform.OutputKeys;导入 javax.xml.transform.Transformer;导入 javax.xml.transform.TransformerFactory;导入 javax.xml.transform.dom.DOMSource;导入 javax.xml.transform.stream.StreamResult;导入 javax.xml.parsers.DocumentBuilderFactory;导入 org.apache.poi.hwpf.HWPFDocument;导入 org.apache.poi.hwpf.HWPFDocumentCore;导入 org.apache.poi.hwpf.usermodel.Paragraph;导入 org.apache.poi.hwpf.converter.WordToHtmlConverter;导入 org.apache.poi.hwpf.converter.FontReplacer;导入 org.apache.poi.hwpf.converter.FontReplacer.Triplet;导入 org.w3c.dom.Document;导入 java.awt.Desktop;公共类 TestWordToHtmlConverter {public static void main(String[] args) 抛出异常 {文档 newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();WordToHtmlConverter wordToHtmlConverter = 新 WordToHtmlConverter(newDocument) {protected void processParagraph(HWPFDocumentCore hwpfDocument,org.w3c.dom.Element 父元素,int currentTableLevel,段落段落,java.lang.String bulletText) {如果 (bulletText!="") {//System.out.println((int)bulletText.charAt(0));bulletText = bulletText.replace("\uF0B7", "\u2022");bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");}super.processParagraph(hwpfDocument,parentElement,currentTableLevel,段落,bulletText);}};wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));StringWriter stringWriter = new StringWriter();变压器变压器 = TransformerFactory.newInstance().newTransformer();Transformer.setOutputProperty( OutputKeys.INDENT, "yes" );Transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");transformer.setOutputProperty(OutputKeys.METHOD, "html");转换器.transform(新的DOMSource(wordToHtmlConverter.getDocument()),新的StreamResult(stringWriter));String html = stringWriter.toString();try(PrintWriter out = new PrintWriter("WordBulletList.html")) {out.println(html);}File htmlFile = new File("WordBulletList.html");Desktop.getDesktop().browse(htmlFile.toURI());}}
HTML:
<预><代码>...<body class="b1 b2"><p class="p1"><span>单词项目符号列表:</span></p><p class="p2"><span class="s1">• </span><span>Bullet1</span></p><p class="p2"><span class="s1"> ⚪ </span><span>Bullet2</span></p><p class="p2"><span class="s1"> ▪ </span><span>Bullet3</span></p><p class="p2"><span class="s1"> ⚪ </span><span>Bullet2</span></p><p class="p2"><span class="s1">• </span><span>Bullet1</span></p><p class="p1"><span>结束</span></p>