如何用scala方式解析基于行的文本文件(.mht)? [英] How to parse line-based text file(.mht) in scala way?
问题描述
我想使用scala来解析.mht文件,但我发现我的代码与Java完全相同。
I want to use scala to parse a .mht file, but I found my code is exactly like Java.
以下是 mht
文件样本:
From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
charset="utf-8"
type="text/html";
boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit
<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
有一条名为 boundary
的专线,它是一个分隔线:
There is a special line called boundary
, which is a separator line:
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
第一部分是关于这个文件,可以忽略。以下是4个区块,第一个是 html
文件,其他是 jpg
图像 base64
编码文本。
The first part is some information about this file, which can be ignored. Following are 4 blocks, the first one is a html
file, others are jpg
images with base64
encoded text.
如果我使用Java,代码如下:
If I use Java, the code is like:
BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;
String boundary = null;
// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;
while((line=reader.readLine())!=null) {
// first, get the boundary
if(boundary==null) {
if(line.trim().startsWith("boundary=\"") {
boundary = substringBetween(line, "\"", "\"");
}
continue;
}
if(line.equals("--"+boundary) { // new block
if(contentType!=null) {
// save data to a file
}
encoding=null;
contentType=null;
location = null;
data = new ArrayList<String>();
} else {
if(id==null || contentType==null || location ==null) {
if(line.trim().startsWith("Content-Type:") { /* get content type */ }
// else check encoding
// else check location
} else {
data.add(line);
}
}
}
我试图用scala重写代码,但我发现我的代码的结构几乎相同,除了我使用scala语法而不是Java。
I tried to use scala to rewrite the code, but I found the structure of my code is nearly the same, except I used the scala syntax instead of Java.
是否有scala方法来做同样的事情工作?
Is there a scala way to do the same work?
PS:我不想将整个文件加载到内存中,因为文件很大。相反,我想逐行阅读和解析它。
PS: I don't want to load the full file into memory, since the file is huge. Instead I want to read and parse it line by line.
感谢您的帮助!
推荐答案
这可能是一个非常简单的状态机用例。
This could be a very simple use case of state machine.
import collection.mutable.ListBuffer
case class Part(contentType:Option[String], encoding:Option[String], location:Option[String], data:ListBuffer[String])
var boundary: String = null
val Boundary = """.*boundary="(.*)"""".r
var state = 0
val IN_PART = 1
val IN_DATA = 2
var _contentType:Option[String] = None
var _encoding:Option[String] = None
var _location:Option[String] = None
var _data = new ListBuffer[String]()
Source.fromFile("test.mht").getLines.foreach{
case Boundary(b) => boundary = b
case `boundary` =>
_contentType = None
_encoding = None
_location = None
_data = new ListBuffer[String]()
state = IN_PART
case "" => state match {
case IN_PART => state = IN_DATA
case IN_DATA =>
var currentPart = Part(_contentType, _encoding, _location, _data)
/* deal with current Part as allData.last */
case _ =>
}
case line => state match {
case IN_DATA => _data.append(line)
case IN_PART => line.split(":") match {
case Array("Content-Type", t) => _contentType = Some(t)
case Array("Content-Transfer-Encoding", e) => _encoding = Some(e)
case Array("Content-Location", l) => _location = Some(l)
case _ =>
}
}
}
这篇关于如何用scala方式解析基于行的文本文件(.mht)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!