如何使用POI api在java中读取doc和docx文件 [英] How to read doc and docx file in java with POI api
问题描述
我正在尝试阅读 doc 和 docx 文件.这是代码:
static String distination="E:\\静态字符串 docFileName="Requirements.docx";public static void main(String[] args) 抛出 FileNotFoundException,IOException {//TODO代码应用逻辑在这里ReadFile rf= 新的 ReadFile();rf.ReadFileParagraph(distination+docFileName);}public void ReadFileParagraph(String path) 抛出 FileNotFoundException, IOException{FileInputStream fis;文件文件=新文件(路径);fis=new FileInputStream(file.getAbsolutePath());字符串文件名=file.getName();字符串文件扩展名=文件扩展名(路径);if(fileExtension.equals("doc")){HWPFDocument 文档=新的 HWPFDocument(fis);WordExtractor DocExtractor = new WordExtractor(document);ReadDocFile(DocExtractor,filename);}否则 if(fileExtension.equals("docx")){XWPFDocument documentX = new XWPFDocument(fis);列表pera =documentX.getParagraphs();ReadDocXFile(pera,filename);}别的{System.out.println("格式不匹配");}}public void ReadDocFile(WordExtractor 提取器,字符串文件名){for (字符串段落:extractor.getParagraphText()) {System.out.println("段落:"+段落);}}public void ReadDocXFile(Listextractor,String filename){对于(XWPFParagraph 段落:提取器){System.out.println("问题:"+paragraph.getParagraphText());}}公共字符串文件扩展名(字符串文件名){String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());退货延期;}
当我想读取 docx 文件时,此代码给出一个异常:
线程main"中的异常 java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException在 l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)在 autometictagdetection.TagDetection.main(TagDetection.java:36)引起:java.lang.ClassNotFoundException:org.apache.xmlbeans.XmlException在 java.net.URLClassLoader$1.run(URLClassLoader.java:366)在 java.net.URLClassLoader$1.run(URLClassLoader.java:355)在 java.security.AccessController.doPrivileged(Native Method)在 java.net.URLClassLoader.findClass(URLClassLoader.java:354)在 java.lang.ClassLoader.loadClass(ClassLoader.java:423)在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)在 java.lang.ClassLoader.loadClass(ClassLoader.java:356)... 2 更多Java 结果:1
另一个问题是当我想读取一个Doc文件时,它可以很好地读取一些文件,但对于一些文件,它给出了这样的异常
线程main" org.apache.poi.hwpf.OldWordFileFormatException 中的异常:文档太旧 - Word 95 或更旧.试试 HWPFoldDocument 吧?在 org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:222)在 org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:186)在 org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:174)在 l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)在 autometictagdetection.TagDetection.main(TagDetection.java:36)Java 结果:1
我在 http://poi.apache 中看到 POI API 支持字 6 和字 95.org/hwpf/index.html.请问有人能给出解决这两个问题的方法吗?
core maven dependencies required this is the solution to Problem Number 1
<依赖><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>3.15</version></依赖><!-- 对于 .DOCX 文件 --><依赖><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.15</version></依赖><!-- 对于 .DOC 文件 --><依赖><groupId>org.apache.poi</groupId><artifactId>poi-scratchpad</artifactId><version>3.9</version></依赖>
<块引用>
对于问题 2 从原始源代码来看,似乎 POI 不支持太旧的文档
/*** 此构造函数从特定点加载 Word 文档* 在POIFSFileSystem 中,可能不是默认值.* 通常用于打开嵌入的文档.** @param directory 包含 Word 文档的 DirectoryNode.* @throws IOException 如果传入的有意外的 IOException* 在 POIFS 文件系统中.*/公共 HWPFDocument(DirectoryNode directory) 抛出 IOException{//加载主流和FIB//也处理 HPSF 位超级(目录);//这个文件对我们来说太旧了吗?if(_fib.getFibBase().getNFib() <106) {throw new OldWordFileFormatException("文档太旧 - Word 95 或更旧版本.试试 HWPFoldDocument 吗?");}
I am trying to read doc and docx files. here is the code:
static String distination="E:\\
static String docFileName="Requirements.docx";
public static void main(String[] args) throws FileNotFoundException, IOException {
// TODO code application logic here
ReadFile rf= new ReadFile();
rf.ReadFileParagraph(distination+docFileName);
}
public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
{
FileInputStream fis;
File file = new File(path);
fis=new FileInputStream(file.getAbsolutePath());
String filename=file.getName();
String fileExtension=fileExtension(path);
if(fileExtension.equals("doc"))
{
HWPFDocument document=new HWPFDocument(fis);
WordExtractor DocExtractor = new WordExtractor(document);
ReadDocFile(DocExtractor,filename);
}
else if(fileExtension.equals("docx"))
{
XWPFDocument documentX = new XWPFDocument(fis);
List<XWPFParagraph> pera =documentX.getParagraphs();
ReadDocXFile(pera,filename);
}
else
{
System.out.println("format does not match");
}
}
public void ReadDocFile(WordExtractor extractor,String filename)
{
for (String paragraph : extractor.getParagraphText()) {
System.out.println("Peragraph: "+paragraph);
}
}
public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
{
for (XWPFParagraph paragraph : extractor) {
System.out.println("Question: "+paragraph.getParagraphText());
}
}
public String fileExtension(String filename)
{
String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
return extension;
}
this code give an exception when I want to read a docx file:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
... 2 more
Java Result: 1
Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that
Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The document is too old - Word 95 or older. Try HWPFOldDocument instead?
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1
I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?
core maven dependencies required this is the solution to Problem Number 1
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOCX FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOC FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.9</version>
</dependency>
For Problem 2 From the original source code , seems POI doesn't support documents way too old
/**
* This constructor loads a Word document from a specific point
* in a POIFSFileSystem, probably not the default.
* Used typically to open embeded documents.
*
* @param directory The DirectoryNode that contains the Word document.
* @throws IOException If there is an unexpected IOException from the passed
* in POIFSFileSystem.
*/
public HWPFDocument(DirectoryNode directory) throws IOException
{
// Load the main stream and FIB
// Also handles HPSF bits
super(directory);
// Is this document too old for us?
if(_fib.getFibBase().getNFib() < 106) {
throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
}
这篇关于如何使用POI api在java中读取doc和docx文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!