如何判断POI中的文件是doc还是docx [英] how to judge if the file is doc or docx in POI

查看:1408
本文介绍了如何判断POI中的文件是doc还是docx的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题可能有些混乱.最简单的方法必须通过扩展名来判断,就像:

The title may be a little confusing. The simplest method must be judging by extension name just like:

// is represents the InputStream   
if (filePath.endsWith("doc")) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
} else if(filePath.endsWith("docx")) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
}

这在大多数情况下都有效.但是我发现对于某些扩展名为doc的文件(本质上是docx文件),如果使用winrar打开,则会找到xml文件.众所周知,docx文件是zip文件,由xml文件组成. 我相信这个问题一定不能少见.但是我没有找到任何有关此的信息.显然,通过扩展名判断读取docdocx是不合适的.

This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files. I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.

就我而言,我必须阅读很多文件.我什至会读取压缩文件zip7z甚至rar中的docdocx.因此,我必须通过inputStream而不是File或其他内容读取内容.因此如何知道来自Apache POI 的文件是.docx还是.doc格式,完全不适合我使用ZipInputStream的情况.

In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.

判断文件是doc还是docx的最佳方法是什么?我想要一种解决方案,以从可能为docdocx的文件中读取内容.但不仅只是简单地判断它是doc还是docx.显然,对于我的情况,ZipInpuStream不是一个好方法.而且我认为这也不适合其他人.为什么我必须通过异常判断文件是doc还是docx?

What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?

推荐答案

使用当前稳定的apache poi版本3.17,您可以使用

Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.

示例:

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;

import org.apache.poi.poifs.filesystem.FileMagic;

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadWord {

 static String read(InputStream is) throws Exception {

System.out.println(FileMagic.valueOf(is));

  String text = "";

  if (FileMagic.valueOf(is) == FileMagic.OLE2) {
   WordExtractor ex = new WordExtractor(is);
   text = ex.getText();
   ex.close();
  } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
   XWPFDocument doc = new XWPFDocument(is);
   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
   text = extractor.getText();
   extractor.close();
  }

  return text;

 }

 public static void main(String[] args) throws Exception {

  InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
  System.out.println(read(is));
  is.close();

 }
}

这篇关于如何判断POI中的文件是doc还是docx的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆