Apache的POI HWPF - 转换中的doc文件为PDF问题 [英] Apache POI HWPF - problem in convert doc file to pdf

查看:866
本文介绍了Apache的POI HWPF - 转换中的doc文件为PDF问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前工作的Java项目,使用Apache POI的。
现在在我的项目,我想doc文件转换为PDF文件。转换成功完成,但我只在PDF没有任何文字样式或文字的颜色得到的文本。
我的PDF文件看起来像一个黑色&安培;白色。虽然我的doc文件是有色和有文字的不同的风格。

I am currently working Java project with use of apache poi. Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour. My pdf file looks like a black & white. While my doc file is coloured and have different style of text.

这是我的code,

 POIFSFileSystem fs = null;  
 Document document = new Document(); 

 try {  
     System.out.println("Starting the test");  
     fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));  

     HWPFDocument doc = new HWPFDocument(fs);  
     WordExtractor we = new WordExtractor(doc);  

     OutputStream file = new FileOutputStream(new File("/document/test.pdf")); 

     PdfWriter writer = PdfWriter.getInstance(document, file);  

     Range range = doc.getRange();
     document.open();  
     writer.setPageEmpty(true);  
     document.newPage();  
     writer.setPageEmpty(true);  

     String[] paragraphs = we.getParagraphText();  
     for (int i = 0; i < paragraphs.length; i++) {  

         org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
        // CharacterRun run = pr.getCharacterRun(i);
        // run.setBold(true);
        // run.setCapitalized(true);
        // run.setItalic(true);
         paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");  
     System.out.println("Length:" + paragraphs[i].length());  
     System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());  

     // add the paragraph to the document  
     document.add(new Paragraph(paragraphs[i]));  
     }  

     System.out.println("Document testing completed");  
 } catch (Exception e) {  
     System.out.println("Exception during test");  
     e.printStackTrace();  
 } finally {  
                 // close the document  
    document.close();  
             }  
 }  

请帮我。

日Thnx提前。

推荐答案

如果你看的Apache提卡,有阅读从HWPF文档一些样式信息的很好的例子。在Tika的code基础上,HWPF内容生成HTML,但你应该发现的东西非常相似的作品你的情况。

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.

该提卡类
<一href=\"https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java\" rel=\"nofollow\">https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

有一点要注意的Word文档是,一切都在任意一个字符运行应用了相同的格式。因此,一个段落是由一个或多个字符运行。一些样式被施加到一个段落,其他部分的运行完成。根据什么格式的利益,你,也可​​能因此受到的段落或运行。

One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

这篇关于Apache的POI HWPF - 转换中的doc文件为PDF问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆