我需要将 Apache POI 图片从 word 文档转换为 html 文件 [英] I need Apache POI Pictures converted from a word document to a html file
问题描述
我有一些代码使用 Java Apache POI 库打开 Microsoft Word 文档并将其转换为 html,使用 Apache POI,它还获取文档上图像的字节数组数据.但我需要将此信息转换为 html 以写出到 html 文件.任何提示或建议将不胜感激.请记住,我是一名桌面开发人员而不是 Web 程序员,所以当您提出建议时,请记住这一点.下面的代码获取图像.
I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
然后下面的代码将文档转换为 html.有没有办法在下面的代码中将 byteArray 添加到 ByteArrayOutputStream 中?
Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
推荐答案
查看 org.apache.poi.hwpf.converter.WordToHtmlConverter
的源代码
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740一>
它在 JavaDoc 中声明:
Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter
at
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
It states in the JavaDoc:
此实现不会创建图像或指向它们的链接.这可以是通过覆盖 {@link #processImage(Element, boolean, Picture)} 方法进行更改
如果您查看 AbstractWordConverter.java 中第 790 行的 processImage(...)
方法,看起来该方法正在调用另一个名为 processImageWithoutPicturesManager(...)
.
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740一>
这个方法再次在 WordToHtmlConverter
中定义,看起来很像你想要增加代码的地方(第 317 行):
If you take a look at that processImage(...)
method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...)
.
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
This method is defined in WordToHtmlConverter
again and looks suspiciously exact like the place you want to grow your code (line 317):
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
// no default implementation -- skip
currentBlock.appendChild(htmlDocumentFacade.document
.createComment("Image link to '"
+ picture.suggestFullFileName() + "' can be here"));
}
我认为您应该开始将图像插入到流程中.
I think you have the point where to start inserting the images into the flow.
创建转换器的子类,例如
Create a subclass of the converter, e.g.
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter
然后覆盖该方法并将任何代码放入其中.
我还没有测试过,但从理论上我看到的应该是正确的方法.
and then override the method and place whatever code into it.
I haven't tested it, but it should be the right way from what I see theoretically.
这篇关于我需要将 Apache POI 图片从 word 文档转换为 html 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!