如何解释原始gmail数据 [英] How to interpret raw gmail data

查看:153
本文介绍了如何解释原始gmail数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解如何分析Gmail电子邮件的原始数据并保存其元素(即正文,主题,日期和附件)。我找到了用于解析我认为可以使用的多部分数据的代码示例。



原始Gmail数据 - 请参阅此说明 [ ^ ]



我正在寻找的是Gmail原始数据的特定解决方案,这是一个复杂的例子在MIME上,嵌入了多种类型的元素(图像,HTML /富文本,附件)。我的目标是提取这些元素并将它们分开存储。我不是在寻找像POP3这样的交互式协议,而是一个静态的协议,这意味着使用这些原始数据,即使离线也可以获得整个电子邮件及其元素。



IDE - 我使用Visual Studio Ultimate,C ++和Win32 API。



我尝试过:



例如这篇文章 [ ^ ]似乎有解析此类电子邮件的构建块。但是我正在寻找一种专用于这种原始数据的解决方案,因为这种类型的数据非常复杂,将各种元素,附件组合在一个文件(或数据块​​)中。





这是我目前的代码。



I am trying to find out how to analyze raw data of a Gmail email and save its elements (i.e. Body, Subject, Date and attachments). I found code samples for parsing multi-part data which I think can be used.

Raw Gmail Data - See this explaination[^]

What I am looking for is a specific solution for Gmail raw data, being a complex example on MIME, having multiple types of elements embedded (images, HTML / rich text, attachments). My goal is to extract these elements and store them separately. I am not looking for an interactive protocol such as POP3 but a static one, meaning that with this raw data, one can get the entire email along with its elements even when offline.

IDE - I am using Visual Studio Ultimate, C++ along with Win32 API.

What I have tried:

For example this article[^] seems to have the building blocks for parsing such email. However I am looking for a solution dedicated to such raw data, as this type of data is quite complex, combining various elements, attachments, all in one file (or block of data).


Here is my current code.

LPCSTR szMailId, LPCSTR szMailBody;
MIMELIB::CONTENT c;
while ((*szMailBody == ' ') || (*szMailBody == '\r') || (*szMailBody == '\n'))
{
    szMailBody++;
}
char deli[] = "<pre class=\"raw_message_text\" id=\"raw_message_text\">";
szMailBody = strstr(szMailBody, deli);
szMailBody += strlen(deli);

CStringA Body = szMailBody;
Body = Body.Left(Body.Find("<//pre><//div><//div><//div><//body><//html>"));
Body = Body.Mid(Body.Find("<html>"));

szMailBody = Body.GetString();
if (c.Parse(szMailBody) != MIMELIB::MIMEERR::OK)
    return;
// Get some headers
auto senderHdr = c.hval("From");
string strDate = c.hval("Date");    // Example Sat, 13 Jan 2018 07:54:39 -0500 (EST)
auto subjectHdr = c.hval("Subject");

auto a1 = c.hval("Content-Type", "boundary");
// Not a multi-part mail if empty
// Then use c.Decode() to get and decode the single part body
if (a1.empty())
    return;
vector<MIMELIB::CONTENT> Contents;
MIMELIB::ParseMultipleContent2(szMailBody, strlen(szMailBody), a1.c_str(), Contents);

int i;
for (i = 0; i < Contents.size(); i++)
{
    vector<char> d;
    string type = Contents[i].hval("Content-type");
    d = Contents[i].GetData(); // Decodes from Base64 or Quoted-Printable
}

推荐答案

除了你提到的线程这篇文章似乎解决了你的问题,并在这里得到答案:



没有特定于Gmail的原始数据。它是 RFC 2822:Internet邮件格式 [ ^ ]和相关的RFC,如RFC 2045 - 2049,用于MIME扩展。



这些RFC包含编写解析器的必要信息。





使用<的示例代码i> mimelib.h 来自上述文章的文件。使用VS 2017进行编译和测试。需要 / Zc:strictStrings -



In addition to your thread at the mentioned article which seems to have solved your problem and to have an answer here:

There are no Gmail specific "raw" data. It is the format of mail messages as defined by RFC 2822: Internet Message Format[^] and related RFCs like RFC 2045 - 2049 for the MIME extensions.

Those RFCs contain the necessary information to write a parser.


Example code using the mimelib.h file from the mentioned article. Compiled and tested with VS 2017. Requires /Zc:strictStrings-.

#include "stdafx.h"

#include <windows.h>
#include <WinInet.h>
#include <string>
#include <sstream>
#include <vector>
#include <memory>
#include <intrin.h>

using namespace std;

#include "mimelib.h"

#pragma comment(lib, "crypt32")

MIMELIB::MIMEERR ParsePart(MIMELIB::CONTENT& c, const char* szPart = "")
{
    MIMELIB::MIMEERR merr = MIMELIB::MIMEERR::OK;
    auto boundary = c.hval("Content-Type", "boundary");
    // Single part
    if (boundary.empty())
    {
        std::string strPart = (szPart && *szPart) ? szPart : "1";
        auto typeHdr = c.hval("Content-Type");
        if (typeHdr.empty())
        {
            wprintf(L"Part %hs: Default (single)\n", strPart.c_str());
            typeHdr = "text/plain;";
        }
        else
        {
            wprintf(L"Part %hs: %hs\n", strPart.c_str(), typeHdr.c_str());
        }
        auto fileName = c.hval("Content-Disposition", "filename");
        if (fileName.empty())
        {
            // Create a file name from part and an extension that matches the content type
            std::string ext = "txt";
            auto subTypeS = typeHdr.find('/');
            auto subTypeE = typeHdr.find(';');
            if (subTypeS > 0 && subTypeE > subTypeS)
            {
                subTypeS++;
                ext = typeHdr.substr(subTypeS, subTypeE - subTypeS);
            }
            if (ext == "plain")
                ext = "txt";
            else if (ext == "octet-stream")
                ext = "bin";
            fileName = "Part";
            fileName += strPart;
            fileName += '.';
            fileName += ext;
        }
        // Get the decoded body of the part
        vector<char> partData;
        c.DecodeData(partData);
        // TODO: Decode fileName if it is inline encoded
        FILE *f;
        errno_t err = fopen_s(&f, fileName.c_str(), "wb");
        if (err)
        {
            char errBuf[128];
            strerror_s(errBuf, err);
            fwprintf(stderr, L" Failed to create file %hs: %hs\n", fileName.c_str(), errBuf);
        }
        else
        {
            fwrite(partData.data(), partData.size(), 1, f);
            fclose(f);
            wprintf(L" Saved part to file %hs\n", fileName.c_str());
        }
    }
    else
    {
        // Decoded part of mail (full mail with top level call)
        auto data = c.GetData();
        // Split it into the boundary separated parts 
        vector<MIMELIB::CONTENT> Contents;
        merr = MIMELIB::ParseMultipleContent2(data.data(), data.size(), boundary.c_str(), Contents);
        if (MIMELIB::MIMEERR::OK == merr)
        {
            int part = 1;
            for (auto & cp : Contents)
            {
                std::string strPart;
                if (szPart && *szPart)
                {
                    strPart = szPart;
                    strPart += '.';
                }
                char partBuf[16];
                _itoa_s(part, partBuf, 10);
                strPart += partBuf;
                ParsePart(cp, strPart.c_str());
                ++part;
            }
        }
    }
    return merr;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
    {
        fwprintf(stderr, L"Usage: ParseMail <file>\n");
        return 1;
    }
    struct _stat st;
    if (_stat(argv[1], &st))
    {
        fwprintf(stderr, L"File %hs not found\n", argv[1]);
        return 1;
    }
    FILE *f = NULL;
    errno_t err = fopen_s(&f, argv[1], "rb");
    if (err)
    {
        char errBuf[128];
        strerror_s(errBuf, err);
        fwprintf(stderr, L"File %hs can't be opened: %hs\n", argv[1], errBuf);
        return 1;
    }
    char *buf = new char[st.st_size + 1];
    fread(buf, 1, st.st_size, f);
    buf[st.st_size] = 0;
    fclose(f);

    MIMELIB::CONTENT c;
    MIMELIB::MIMEERR merr = c.Parse(buf);
    if (merr != MIMELIB::MIMEERR::OK)
    {
        fwprintf(stderr, L"Error pasing mail file %hs\n", argv[1]);
    }
    else
    {
        auto senderHdr = c.hval("From");
        auto dateHdr = c.hval("Date");
        auto subjectHdr = c.hval("Subject");
        wprintf(L"From: %hs\n", senderHdr.c_str());
        wprintf(L"Date: %hs\n", dateHdr.c_str());
        wprintf(L"Subject: %hs\n\n", subjectHdr.c_str());
        merr = ParsePart(c);
    }
    delete[] buf;
    return merr;
}



多部分邮件的示例输出:


Example output for a multipart mail:

From: [redacted]
Date: Tue, 26 Sep 2017 09:44:15 +0200
Subject: =?ISO-8859-1?Q?WG=3A_Haftverzichtserkl=E4rung_f=FCr_[...]_Fa=2E_S
iS?=; =?ISO-8859-1?Q?_-_EMB_168_-_12=2E10=2E2017?=

Part 1.1: text/plain; charset="UTF-8"
 Saved part to file Part1.1.txt
Part 1.2: text/html; charset="UTF-8"
 Saved part to file Part1.2.html
Part 2: application/octet-stream; name="HaVerzSiS.pdf"
 Saved part to file HaVerzSiS.pdf
Part 3: image/jpeg; name="Liegeplatz FS EMB.jpg"
 Saved part to file Liegeplatz FS EMB.jpg

[/ EDIT]


这是一个副本典型的Gmail邮件(有趣的是你):

Here is a copy of a typical Gmail message (interestingly from you):
Delivered-To: anonymousnam AT gmail.com
Received: by 10.2.15.201 with SMTP id 70csp1706370jao;
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
X-Google-Smtp-Source: ACJfBovDfSZaL48gp2hiXRdWrkQ2fN4ADImypAfgO6nn3bL9YXe9pyOS1NCsj6nejU8n0AFGoP8W
X-Received: by 10.107.140.78 with SMTP id o75mr23661888iod.219.1515751682675;
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1515751682; cv=none;
        d=google.com; s=arc-20160816;
        b=Wronqhb0qgSWSapehG2PSI00FvI+Y2o3/MG8O6czKU/v9/9ETX4ObQ6hBP4fzRoAko
         U8mhgMZqhMoIKVv7czqG2g0S/VxBBkmPNUv7JbLZdZISzsO9e46SdfbhSKJMEdrESxtW
         7tankKOzdVFh5kbOX0ZWJrrCO1/a15lEo5MBChjr0apydxskoXq7p2vNmafiC9pqKads
         jNK2+Pkc0Y2OeEfL67Vs8IlXN+u1y2TYn9A8uZbfdNmPL6zq3rJ31v9hDdyXM3p3oTg4
         YP0K1+el4JFRgF4zo0Pyg4gFl62QzIv/SfP12o6ihsOIJZ68eS7PoDHYfZvKQXLySAjH
         2l2w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=content-transfer-encoding:subject:date:to:from:mime-version
         :message-id:arc-authentication-results;
        bh=8sfAQGYMvxQ7PtXZ5Em7odijeBxnxUxWk1qh7LONjxM=;
        b=xHGbZSMYY74D6WFzT2SVpmOjKqALpbjgEopoaeGKE+2mUj77Is+gvHb/Q81aFnvjDY
         0xPKbsKUM6vPYCO9FU9QFumqP/XrYxEVQ5EOzxFk0SGV18QuzLGTkIxTVz97ARYLpnII
         M/gvPlSNX8hUfyzjh/0NsiD/64FMoCmLanRA0aQb+73TUHcKKwIEMhbwgQ9xizvShBKz
         hCMWDz92D6qhTI5Dhhfzjy/xm4j4TTNQxAW5rOgdm5LFg22tgTdkhenGSWUMtgOalstY
         0r4he7xoor6Ut0rT3QBZKr+5kywuyi3XZGVUmXvwKtss3ChDGcL4xqwos4HUzsW6i8YP
         1GFw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of anonymousnam AT codeproject.com designates 76.74.234.221 as permitted sender) smtp.mailfrom=anonymousnam AT codeproject.com
Return-Path: <anonymousnam AT odeproject.com>
Received: from mail.notifications.codeproject.com (mail.notifications.codeproject.com. [76.74.234.221])
        by mx.google.com with ESMTPS id l64si3747147iof.279.2018.01.12.02.08.02
        for <anonymousname AT gmail.com>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
Received-SPF: pass (google.com: domain of anonymousname AT codeproject.com designates 76.74.234.221 as permitted sender) client-ip=76.74.234.221;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of anonymousname AT codeproject.com designates 76.74.234.221 as permitted sender) smtp.mailfrom=anonymousname AT codeproject.com
Message-Id: <5a588902.43bb6b0a.f549b.f8cdSMTPIN_ADDED_MISSING@mx.google.com>
Received: from CP-WEB2 (cp-web2.codeproject.com [192.168.5.52]) by mail.notifications.codeproject.com (Postfix) with ESMTP id 2FB2E1E0DE8 for <anonymousname AT gmail.com>; Fri, 12 Jan 2018 04:55:19 -0500 (EST)
MIME-Version: 1.0
From: CodeProject Answers <anonymousname AT codeproject.com>
To: anonymousname AT gmail.com>
Date: 12 Jan 2018 05:08:01 -0500
Subject: CodeProject | A reply was posted to your comment
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.=
w3.org/TR/html4/loose.dtd"><html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"viewport" content=3D"width=3Ddevice-width">
</head><body style=3D"background-color: white;font-size: 14px; font-family:=
 'Segoe UI', Arial, Helvetica, sans-serif">
<style type=3D"text/css">
body,table,p,div { background-color:white; }
body,table,p,div { background-color:white; }
body, td, p,h1,h2,h6,h3,h4,li,blockquote,div{ font-size:14px; font-family: =
'Segoe UI', Arial, Helvetica, sans-serif; } =20
h1 {font-size: 26px; font-weight: bold; color: #498F00; margin-bottom:5px;m=
argin-top:0px;} =20
h2 { font-size: 24px; font-weight: 500; }
h4 { font-size: 16px; }
h3 {font-size: 11pt; font-weight:bold;} =20
h6 {font-size:6pt;color:#666;margin:0;} =20
table =09=09=09{ width: 100%;} =20
table.themed =09{ background-color:#FAFAFA; } =20
a =09=09=09=09{ text-decoration:none;} =20
a:hover =09=09{ text-decoration:underline;} =20
.tiny-text=09=09{ font-size: 12px; }
.desc =09=09=09{ color:#333333; font-size:12px;}
.themed td  =09{ padding:2px; } =20
.themed .alt-item { background-color:#FEF9E7; } =20
.header =09=09{ font-weight:bold; background-color:#FF9900; vertical-align:=
middle;} =20
.footer =09=09{ font-weight:bold; background-color: #488E00; color:White; v=
ertical-align:middle; }
.signature =09=09{ border-top: solid 1px #CCCCCC; padding-top:0px; margin-t=
op:10px; max-height:150px; overflow:auto;}

.content-list=09=09{ margin-bottom: 17px;}
.content-list-item=09{ margin:     10px 0; }
.doctype img=09=09{ vertical-align:bottom; padding-right:3px;}
.entry=09=09=09    { font-size: 14px; line-height:20px; margin: 0;}
.title=09=09=09    { font-size: 16px; font-weight:500; padding:0; }
.entry=09=09=09=09{ font-size: 14px; color:#666; }
.author, .author a  { font-size: 11px; font-weight:bold; }
.location=09=09    { font-size: 11px; font-weight:bold; color: #999}
.summary            { font-size: 12px; color: #999; padding: 0px 0 10px; }
.theme-fore         { color: #f90; }
.theme-back         { background-color: #f90; }
</style>
<table cellspacing=3D"1" cellpadding=3D"3" class=3D"header" border=3D"0" st=
yle=3D"background-color: #FF9900;width: 100%;font-weight: bold;vertical-ali=
gn: middle"><tbody><tr><td style=3D"font-size: 14px; font-family: 'Segoe UI=
', Arial, Helvetica, sans-serif">
<img border=3D"0" src=3D"https://www.codeproject.com/App_Themes/CodeProject=
/Img/logo225x40.gif" width=3D"225" height=3D"40"></td></tr></tbody></table>

<p style=3D"background-color: white;font-size: 14px; font-family: 'Segoe UI=
', Arial, Helvetica, sans-serif">Michael Haephrati has pos=
ted a reply to your comment about=20
"<a href=3D"https://www.codeproject.com/Answers/1224946/Conversion-to-Unico=
de-Cplusplus-Microsoft-UTF-Nati?cmt=3D969334#cmt969334" style=3D"text-decor=
ation: none">Conversion to Unicode (C++, Microsoft, UTF-16, Native Windows)=
</a>":</p>=20

<blockquote style=3D"font-size: 14px; font-family: 'Segoe UI', Arial, Helve=
tica, sans-serif">Apparently there is no expiration date to questions and w=
hen I looked for unanswered questions, I got here...</blockquote>

<hr class=3D"divider" noshade=3D"noshade" size=3D"1">
<div style=3D"background-color: white;font-size: 14px; font-family: 'Segoe =
UI', Arial, Helvetica, sans-serif"><a href=3D"https://www.codeproject.com" =
style=3D"text-decoration: none">CodeProject</a></div>
<div class=3D"small" style=3D"background-color: white;font-size: 14px; font=
-family: 'Segoe UI', Arial, Helvetica, sans-serif">Note: T=
his message has been sent from an unattended email box.</div>
</body></html>



因此,您可以看到邮件标题是控件标签后跟冒号。邮件正文的开头通过空行与邮件标题分隔,可以是富文本,HTML或纯文本。邮件RFC列出了所有可能的标题标签名称。


So you can see that the message headers are control labels followed by a colon. The start of the message body is separated from the message headers by a blank line, and may be rich text, HTML or plain text. The mail RFC lists all the possible header label names.


这篇关于如何解释原始gmail数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆