Apache POI HWPF - 将 doc 文件转换为 pdf 的问题 [英] Apache POI HWPF - problem in convert doc file to pdf
问题描述
我目前正在使用 apache poi 进行 Java 项目.现在在我的项目中,我想将 doc 文件转换为 pdf 文件.转换成功完成,但我只得到 pdf 文本,而不是任何文本样式或文本颜色.我的 pdf 文件看起来像一个黑色的 &白色的.虽然我的文档文件是彩色的并且具有不同的文本样式.
这是我的代码,
POIFSFileSystem fs = null;文档文档 = 新文档();尝试 {System.out.println("开始测试");fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));HWPFDocument doc = 新的 HWPFDocument(fs);WordExtractor we = new WordExtractor(doc);OutputStream file = new FileOutputStream(new File("/document/test.pdf"));PdfWriter writer = PdfWriter.getInstance(document, file);范围范围 = doc.getRange();文档.open();writer.setPageEmpty(true);文档.newPage();writer.setPageEmpty(true);String[] 段落 = we.getParagraphText();for (int i = 0; i
请帮帮我.
提前谢谢.
如果您查看 Apache Tika,这里有一个很好的示例,可以从 HWPF 文档中读取一些样式信息.Tika 中的代码根据 HWPF 内容生成 HTML,但您应该会发现非常相似的内容适用于您的情况.
关于 Word 文档需要注意的一点是,任何 Character Run 中的所有内容都应用了相同的格式.因此,一个段落由一个或多个字符运行组成.一些样式应用于段落,其他部分在运行中完成.根据您感兴趣的格式,它可能在段落或运行中.
I am currently working Java project with use of apache poi. Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour. My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.
If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.
这篇关于Apache POI HWPF - 将 doc 文件转换为 pdf 的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!