如何判断POI中的文件是doc还是docx [英] how to judge if the file is doc or docx in POI
问题描述
标题可能有些混乱.最简单的方法必须通过扩展名来判断,就像:
The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
这在大多数情况下都有效.但是我发现对于某些扩展名为doc
的文件(本质上是docx
文件),如果使用winrar打开,则会找到xml
文件.众所周知,docx
文件是zip
文件,由xml
文件组成.
我相信这个问题一定不能少见.但是我没有找到任何有关此的信息.显然,通过扩展名判断读取doc
或docx
是不合适的.
This works in most cases. But I have found that for certain file whose extension is doc
(a docx
file essentially) if you open using winrar, you will find xml
files. As it is known that a docx
file is a zip
file consists of xml
files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc
or docx
is not appropriate.
就我而言,我必须阅读很多文件.我什至会读取压缩文件zip
,7z
甚至rar
中的doc
或docx
.因此,我必须通过inputStream而不是File或其他内容读取内容.因此如何知道来自Apache POI 的文件是.docx还是.doc格式,完全不适合我使用ZipInputStream
的情况.
In my case, I have to read a lot of files. And I will even read the doc
or docx
inside a compressed file, zip
, 7z
or even rar
. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream
.
判断文件是doc
还是docx
的最佳方法是什么?我想要一种解决方案,以从可能为doc
或docx
的文件中读取内容.但不仅只是简单地判断它是doc还是docx.显然,对于我的情况,ZipInpuStream
不是一个好方法.而且我认为这也不适合其他人.为什么我必须通过异常判断文件是doc
还是docx
?
What is the best way to judge a file is a doc
or docx
? I want a solution to read the content from a file which may be doc
or docx
. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream
is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc
or docx
by an exception?
推荐答案
Using the current stable apache poi
version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
示例:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
这篇关于如何判断POI中的文件是doc还是docx的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!