如何在 Java 中读取或解析 MHTML (.mht) 文件 [英] How to read or parse MHTML (.mht) files in java

查看:104
本文介绍了如何在 Java 中读取或解析 MHTML (.mht) 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要挖掘大多数已知文档文件的内容,例如:

I need to mine the content of most of known document files like:

  1. pdf
  2. html
  3. doc/docx 等

对于我打算使用的大多数文件格式:

For most of these file formats I am planning to use:

http://tika.apache.org/

但截至目前 Tika 不支持 MHTML (*.mht) 文件.. ( http://en.wikipedia.org/wiki/MHTML )C#中的例子很少( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) 但我在 Java 中没有找到.

But as of now Tika does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML ) There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.

我尝试在 7Zip 中打开 *.mht 文件,但失败了...尽管 WinZip 能够将文件解压缩为图像和文本(CSS、HTML、脚本)作为文本和二进制文件...

I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...

根据 MSDN 页面(http:///msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content )和我之前提到的 code project 页面...... mht 文件使用GZip 压缩....

As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project page i mentioned earlier ... mht files use GZip compression ....

尝试在java中解压导致以下异常:使用 java.uti.zip.GZIPInputStream

Attempting to decompress in java results in following exceptions: With java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

并使用 java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

请高手指点解压方法....

Kindly suggest how to decompress it....

谢谢....

推荐答案

坦白说,我没想到会在不久的将来找到解决方案,正准备放弃,但我在此页面上偶然发现了一些方法:

Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page:

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

虽然,乍一看并不是很吸引人.但如果你仔细观察,你会得到线索.读完这篇文章后,我启动了我的 IE 并随机开始将页面保存为 *.mht 文件.让我一行一行...

Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as *.mht file. Let me go line by line...

但是让我事先解释一下,我的最终目标是分离/提取 html 内容并解析它...解决方案本身并不完整,因为它取决于 字符setencoding 我在保存时选择.但即使它会以轻微的故障提取单个文件......

But let me explain beforehand that my ultimate goal was to separate/extract out the html content and parse it... the solution is not complete in itself as it depends on the character set or encoding I choose while saving. But even though it will extract the individual files with minor hitches...

我希望这对任何试图解析/解压缩 *.mht/MHTML 文件的人有用:)

I hope this will be useful for anyone who is trying to parse/decompress *.mht/MHTML files :)

======== 说明 ========** 取自 mht 文件 **

======= Explanation ======== ** Taken from a mht file **

From: "Saved by Windows Internet Explorer 7"

是用来保存文件的软件

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

主题、日期和 MIME 版本……很像邮件格式

Subject, date and mime-version … much like the mail format

  Content-Type: multipart/related;
type="text/html";

这部分告诉我们它是一个 multipart 文档.多部分文档将一组或多组不同的数据组合在一个主体中,multipart Content-Type 字段必须出现在实体的标题中.在这里,我们还可以看到类型为 "text/html".

This is the part which tells us that it is a multipart document. A multipart document has one or more different sets of data combined in a single body, a multipart Content-Type field must appear in the entity's header. Here, we can also see the type as "text/html".

boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

其中最重要的部分.这是将两个不同部分(html、图像、css、脚本等)分开的唯一分隔符.一旦掌握了这一点,一切都会变得简单...现在,我只需要遍历文档并找出不同的部分并根据它们的Content-Transfer-Encoding(base64,引用打印等)......

Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Once you get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their Content-Transfer-Encoding (base64, quoted-printable etc) ... . . .

样品

 ------=_NextPart_000_0007_01CB22D1.93BBD1A0
 Content-Type: text/html;
 charset="utf-8"
 Content-Transfer-Encoding: quoted-printable
 Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =
.
.
.

** JAVA 代码 **

** JAVA CODE **

用于定义常量的接口.

public interface IConstants 
{
    public String BOUNDARY = "boundary";
    public String CHAR_SET = "charset";
    public String CONTENT_TYPE = "Content-Type";
    public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
    public String CONTENT_LOCATION = "Content-Location";

    public String UTF8_BOM = "=EF=BB=BF";

    public String UTF16_BOM1 = "=FF=FE";
    public String UTF16_BOM2 = "=FE=FF";
}

主要解析器类...

/**
 * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
 */
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

/**
 * File to parse and decompose *.mts file in its constituting parts.
 * @author Manish Shukla 
 */

public class MHTParser implements IConstants
{
    private File mhtFile;
    private File outputFolder;

    public MHTParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    /**
     * @throws Exception
     */
    public void decompress() throws Exception
    {
        BufferedReader reader = null;

        String type = "";
        String encoding = "";
        String location = "";
        String filename = "";
        String charset = "utf-8";
        StringBuilder buffer = null;

        try
        {
            reader = new BufferedReader(new FileReader(mhtFile));

            final String boundary = getBoundary(reader);
            if(boundary == null)
                throw new Exception("Failed to find document 'boundary'... Aborting");

            String line = null;
            int i = 1;
            while((line = reader.readLine()) != null)
            {
                String temp = line.trim();
                if(temp.contains(boundary)) 
                {
                    if(buffer != null) {
                        writeBufferContentToFile(buffer,encoding,filename,charset);
                        buffer = null;
                    }

                    buffer = new StringBuilder();
                }else if(temp.startsWith(CONTENT_TYPE)) {
                    type = getType(temp);
                }else if(temp.startsWith(CHAR_SET)) {
                    charset = getCharSet(temp);
                }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
                    encoding = getEncoding(temp);
                }else if(temp.startsWith(CONTENT_LOCATION)) {
                    location = temp.substring(temp.indexOf(":")+1).trim();
                    i++;
                    filename = getFileName(location,type);
                }else {
                    if(buffer != null) {
                        buffer.append(line + "
");
                    }
                }
            }

        }finally 
        {
            if(null != reader)
                reader.close();
        }

    }

    private String getCharSet(String temp) 
    {
        String t = temp.split("=")[1].trim();
        return t.substring(1, t.length()-1);
    }

    /**
     * Save the file as per character set and encoding 
     */
    private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) 
    throws Exception
    {

        if(!outputFolder.exists())
            outputFolder.mkdirs();

        byte[] content = null; 

        boolean text = true;

        if(encoding.equalsIgnoreCase("base64")){
            content = getBase64EncodedString(buffer);
            text = false;
        }else if(encoding.equalsIgnoreCase("quoted-printable")) {
            content = getQuotedPrintableString(buffer);         
        }
        else
            content = buffer.toString().getBytes();

        if(!text)
        {
            BufferedOutputStream bos = null;
            try
            {
                bos = new BufferedOutputStream(new FileOutputStream(filename));
                bos.write(content);
                bos.flush();
            }finally {
                bos.close();
            }
        }else 
        {
            BufferedWriter bw = null;
            try
            {
                bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
                bw.write(new String(content));
                bw.flush();
            }finally {
                bw.close();
            }
        }
    }

    /**
     * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
     * @see http://en.wikipedia.org/wiki/Byte_order_mark
     */
    private byte[] getQuotedPrintableString(StringBuilder buffer) 
    {
        //Set<String> uniqueHex = new HashSet<String>();
        //final Pattern p = Pattern.compile("(=\p{XDigit}{2})*");

        String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=
", "");

        //Matcher m = p.matcher(temp);
        //while(m.find()) {
        //  uniqueHex.add(m.group());
        //}

        //System.out.println(uniqueHex);

        //for (String hex : uniqueHex) {
            //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));
        //}     

        return temp.getBytes();
    }

    /*private String getASCIIValue(String hex) {
        return ""+(char)Integer.parseInt(hex, 16);
    }*/
    /**
     * Although system dependent..it works well
     */
    private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
        return new BASE64Decoder().decodeBuffer(buffer.toString());
    }

    /**
     * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
     * Otherwise it returns 'unknown.<type>'
     */
    private String getFileName(String location, String type) 
    {
        final Pattern p = Pattern.compile("(\w|_|-)+\.\w+");
        String ext = "";
        String name = "";
        if(type.toLowerCase().endsWith("jpeg"))
            ext = "jpg";
        else
            ext = type.split("/")[1];

        if(location.endsWith("/")) {
            name = "main";
        }else
        {
            name = location.substring(location.lastIndexOf("/") + 1);

            Matcher m = p.matcher(name);
            String fname = "";
            while(m.find()) {
                fname = m.group();
            }

            if(fname.trim().length() == 0)
                name = "unknown";
            else
                return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
        }
        return getUniqueName(name,ext);
    }

    /**
     * Returns a qualified unique output file path for the parsed path.</br>
     * In case the file already exist it appends a numarical value a continues
     */
    private String getUniqueName(String name,String ext)
    {
        int i = 1;
        File file = new File(outputFolder,name + "." + ext);
        if(file.exists())
        {
            while(true)
            {
                file = new File(outputFolder, name + i + "." + ext);
                if(!file.exists())
                    return file.getAbsolutePath();
                i++;
            }
        }

        return file.getAbsolutePath();
    }

    private String getType(String line) {
        return splitUsingColonSpace(line);
    }

    private String getEncoding(String line){
        return splitUsingColonSpace(line);
    }

    private String splitUsingColonSpace(String line) {
        return line.split(":\s*")[1].replaceAll(";", "");
    }

    /**
     * Gives you the boundary string
     */
    private String getBoundary(BufferedReader reader) throws Exception 
    {
        String line = null;

        while((line = reader.readLine()) != null)
        {
            line = line.trim();
            if(line.startsWith(BOUNDARY)) {
                return line.substring(line.indexOf(""") + 1, line.lastIndexOf("""));
            }
        }

        return null;
    }
}

问候,

这篇关于如何在 Java 中读取或解析 MHTML (.mht) 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆