我需要的Apache POI图片从Word文档转换为HTML文件 [英] I need Apache POI Pictures converted from a word document to a html file
问题描述
我有一个使用了Java的Apache POI库中打开一个Word文档,并将其转换为HTML,使用Apache的POI一些code和它也得到对文档图像的字节数组数据。但我需要这些信息转换为HTML写出来到HTML文件。任何提示或建议将是AP preciated。请记住,我是一个桌面开发开发商,而不是一个网络程序员,所以当你提出建议,请记住这一点。下面的code获取图像。
私人无效parseWordText(档案文件)抛出IOException
的FileInputStream FS =新的FileInputStream(文件);
DOC =新HWPFDocument(FS);
PicturesTable picTable = doc.getPicturesTable();
如果(picTable!= NULL){
picList =新的ArrayList<图片和GT;(picTable.getAllPictures());
如果(!picList.isEmpty()){
对于(图片图:picList){
字节[]的字节数组= pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
然后就是下code转换为HTML文档。有没有一种办法的ByteArray添加到ByteArrayOutputStream下code?
私人无效convertWordDoctoHTML(档案文件)抛出的ParserConfigurationException,TransformerConfigurationException,TransformerException中,IOException异常{
HWPFDocumentCore wordDocument = NULL;
尝试{
wordDocument = WordToHtmlUtils.loadDoc(新的FileInputStream(文件));
}赶上(IOException异常前){
Exceptions.printStackTrace(除息);
} WordToHtmlConverter wordToHtmlConverter =新WordToHtmlConverter(DocumentBuilderFactory.newInstance()newDocumentBuilder()新建文档());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document中的HTMLDocument = wordToHtmlConverter.getDocument();
NamedNodeMap中的节点= htmlDocument.getAttributes();
ByteArrayOutputStream出=新ByteArrayOutputStream();
DOMSource的DOMSource的=新DOMSource的(HTMLDocument的);
StreamResult streamResult =新的StreamResult(出); TF的TransformerFactory = TransformerFactory.newInstance();
变压器串行= tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING,UTF-8);
serializer.setOutputProperty(OutputKeys.INDENT,是);
serializer.setOutputProperty(OutputKeys.METHOD,HTML);
serializer.transform(DOMSource的,streamResult);
out.close(); 字符串结果=新的String(out.toByteArray());
acDocTextArea.setText(newDocText); 的htmlText =结果;}
查看源$ C $ C为 org.apache.poi.hwpf.converter.WordToHtmlConverter
在
结果
结果
<一href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740\" rel=\"nofollow\">http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
结果
结果
它在JavaDoc规定:
的此实现不创建图像或链接到他们。这可以是
通过重写改变{@link #processImage(元素,布尔,图片)}方法的
如果你看看那个 processImage来(...)
方法AbstractWordConverter.java在行790,它看起来像方法正在调用然后命名为另一种方法 processImageWithoutPicturesManager(...)
。
结果
结果
<一href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740\" rel=\"nofollow\">http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
结果
结果
这种方法在 WordToHtmlConverter
定义再次和详细的可疑看起来像你想要增加你的code(317线)的地方:
@覆盖
保护无效processImageWithoutPicturesManager(元currentBlock,
布尔内联,画中画)
{
//没有默认的实现 - 跳
currentBlock.appendChild(htmlDocumentFacade.document
.createComment(图像链接'
+ picture.suggestFullFileName()+可以在这里));
}
我觉得你的地步,启动图像插入流。
创建转换器的一个子类,例如,搜索
公共类InlineImageWordToHtmlConverter扩展WordToHtmlConverter
然后重写方法和地点的任何code进去。
结果
结果
我没有测试它,但它应该是从我所看到的理论上的正确方式。
I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter
at
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
It states in the JavaDoc:
This implementation doesn't create images or links to them. This can be changed by overriding {@link #processImage(Element, boolean, Picture)} method
If you take a look at that processImage(...)
method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...)
.
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
This method is defined in WordToHtmlConverter
again and looks suspiciously exact like the place you want to grow your code (line 317):
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
// no default implementation -- skip
currentBlock.appendChild(htmlDocumentFacade.document
.createComment("Image link to '"
+ picture.suggestFullFileName() + "' can be here"));
}
I think you have the point where to start inserting the images into the flow.
Create a subclass of the converter, e.g.
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter
and then override the method and place whatever code into it.
I haven't tested it, but it should be the right way from what I see theoretically.
这篇关于我需要的Apache POI图片从Word文档转换为HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!