从引用的回复中解析电子邮件内容 [英] Parse email content from quoted reply

查看:30
本文介绍了从引用的回复中解析电子邮件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚如何从可能包含的任何引用的回复文本中解析出电子邮件的文本.我注意到,通常电子邮件客户端会加上在某某日期某某某写"或在行前加上尖括号.不幸的是,并不是每个人都这样做.有没有人知道如何以编程方式检测回复文本?我正在使用 C# 编写这个解析器.

I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

推荐答案

我对此进行了更多搜索,以下是我的发现.基本上有两种情况你会这样做:当你拥有整个线程时,当你没有时.我将其分为两类:

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

当你有线程时:

如果您拥有整个系列的电子邮件,则可以非常确定您删除的内容实际上是引用文本.有两种方法可以做到这一点.一,您可以使用消息的 Message-ID、In-Reply-To ID 和 Thread-Index 来确定单个消息、它的父级以及它所属的线程.有关这方面的更多信息,请参阅 RFC822RFC2822这篇关于线程的有趣文章,或这篇关于线程的文章.重新组装线程后,您可以删除外部文本(例如 To、From、CC 等...行),然后就完成了.

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

如果您正在处理的邮件没有标题,您还可以使用相似度匹配来确定电子邮件的哪些部分是回复文本.在这种情况下,您必须通过相似性匹配来确定重复的文本.在这种情况下,您可能需要查看 Levenshtein 距离算法,例如 Code Project 上的这个这个.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

无论如何,如果您对线程处理过程感兴趣,请查看有关重新组合电子邮件线程的精彩 PDF.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

当您没有线程时:

如果您只看到来自线程的一条消息,则您必须尝试猜测引文是什么.在这种情况下,以下是我见过的不同引用方法:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

  1. 一行(如 Outlook 中所示).
  2. 角括号
  3. "---原始消息---"
  4. 某某某日,某某写道:"

从那里删除文本,你就完成了.任何这些的缺点是他们都假设发件人将他们的回复放在引用的文本之上并且没有交错(就像互联网上的旧样式一样).如果发生这种情况,祝你好运.我希望这对你们中的一些人有所帮助!

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

这篇关于从引用的回复中解析电子邮件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆