电子邮件解析和处理架构 [英] Email parsing and processing architechture

查看:29
本文介绍了电子邮件解析和处理架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,我正在处理每封电子邮件.假设我为一个系统制作了一个人工智能,他会自动回复他收到的电子邮件,但我仍然不知道从哪里开始.

ok im doing a heavy process on processing each email. lets say im making an AI for a system at he will auto-reply the email that he receive, but im still dont know where to start.

我在想什么

架构 1

问题:

  1. 假设我们有 1000 封电子邮件/秒,邮件服务器、exim 或 sendmail、davecot 等究竟是如何工作的?

  1. lets say we have 1000 emails / sec how does a mail server, exim or sendmail, davecot etc exactly work?

parseandsavetomysql.py 可以通过管道在一秒内处理 1000 封电子邮件吗?这也是如何工作的?顺便说一句,目前它工作正常,但我需要了解这一点.

can the parseandsavetomysql.py process 1000 emails in a sec though piping? how does that work too? btw currently its working fine, but i need to know about this.

我关于工人的逻辑是否正确?还是排队系统?我试图查看 resque 和朋友,但我仍然不明白我们如何锁定会话让我们在这个问题中说嘿,我正在处理这个文件不适用于 email1.rawemail 工作在其他上"我们怎么做才能正确或者更简单的方法?

is my logic correct about a worker? or a queuing system? i have tried to see resque and friends but i still just dont get it how can we lock a session lets say in this this problem "hey im processing this file dont work on email1.rawemail work on other" how can we do that the correct or simpler way?

架构 2

问题?

  1. 如上
  2. pop/stmp 服务器如何能够每秒接收 1000 封电子邮件?
  3. 我们可以通过 imap 和 pop 接收电子邮件吗?因为我们只是在处理 pop3 是选择性能的正确方法吗?我目前正在使用的 php 上有一个 imap_open

插件

  1. 是否有一个很好的链接或博文可以解决和我一样的问题?
  2. 请给我解决我的问题的项目、应用程序或第三方的链接?
  3. 如果有什么想法,请写下来.

感谢您的帮助,亚当·拉马丹

thanks for helping out, Adam Ramadhan

编辑了我当前的架构

推荐答案

就像很多大局"架构问题一样,最好的解决方案实际上是其中之一......这取决于.你能控制部署环境吗?也就是说……您可以使用任何您喜欢的电子邮件服务器,还是只能使用已安装和托管的电子邮件服务器?您可以在与 SMTP 服务相同的机器上运行代码吗?这些问题以及许多其他问题都应被考虑以提出(接近)最佳架构.

Like a lot of "big picture" architecture questions, the best solution is really one of those...it depends. Can you control the deployment environment? That is...can you use whatever e-mail server you'd like, or are you constrained to using one that's already installed and hosted? Can you run code on the same machine as the SMTP service? These questions, and a lot of others should be considered to come up with an (near) optimal architecture.

鉴于此,我将做出一些假设并提供一些我认为值得探索的想法......

Given that, I'm going to make a couple of assumptions and offer some ideas that I think are worth exploring...

您应该研究一个高性能的消息传递系统.具体来说,看看 RabbitMQ.RabbitMQ 可靠且高效,基于异步传入事件的工作负载分配是他们在(在我看来,非常好的)教程中专门讨论的一种模式.

You should look into a high-performance messaging system. Specifically, take a look at RabbitMQ. RabbitMQ is reliable and efficient, and the distribution of workload based on asynchronous incoming events is a pattern that they specifically discuss in their (in my opinion, very good) tutorials.

有了这样的消息服务器,您就有了一个接收传入电子邮件的进程.最好将此作为 SMTP 进程的一部分完成,或者至少非常接近它 - 特别是对于您提到的工作负载.如果您别无选择,那么您关于使用 cron 通过 POP 或 IMAP 收集消息的想法现在必须奏效.

With a messaging server like this, you have one process that receives the incoming e-mail. Preferably this is done as part of the SMTP process, or at least very close to it - especially with the work load that you've mentioned. If you have no other choice, then your ideas about using cron to gather messages via POP or IMAP will have to work, for now.

然后电子邮件收集过程会将消息推送到 RabbitMQ 队列中.(也许不是字面上的电子邮件本身,虽然这是一种可能性,但我想更像是对电子邮件有效存储位置的引用).然后运行多个订阅了命名消息队列的工作进程.RabbitMQ(或您决定的任何消息服务)然后将以循环方式将这些消息分发给各个订阅者.如果已经加载,工作进程可以 NACK 消息,或将自己的控制流消息发送回服务.由于工作负载非常高(同样,就像您提出的那样),我强烈建议采用某种管理流程来密切关注分布式系统的整体健康状况.管理器将收集运行时统计信息(对于整个系统的未来增长规划、优化和重构非常有用),并且能够启动和关闭新的工作进程.在达到如此高的工作量之前,假设您的工作进程稳定并且可以长时间运行而不会出现内存碎片等问题,那么仅使用消息服务器来分发工作就足够了.

The e-mail gathering process would then push messages into the RabbitMQ queue. (Perhaps not literally the e-mails themselves, although that is a possibility, but I was thinking more like references to where the e-mail is efficiently stored). You then run multiple worker processes that are subscribed to a named message queue. RabbitMQ (or whatever messaging service you decide upon) would then distribute those messages in a round-robin fashion to the individual subscribers. If already loaded, worker processes can NACK the message, or send their own control flow message back to the service. With a VERY high workload (again, like you've proposed), I'd highly recommend some kind of management process that keeps tabs on the overall health of the distributed system. The manager would gather run time statistics (VERY useful for future growth planning, optimization, and refactoring of the overall system), and have the ability to spin up and shut down new worker processes. Before you get to that very high workload, and assuming that your worker processes are stable and can live a long time without memory fragmentation, etc., then just using the message server to distribute work should suffice.

就其价值而言,我在编写电子邮件处理器(特别是 xmail -如果您刚刚开始您的项目并且对其早期阶段有很多控制,我会推荐一个).另外,我目前正在使用 RabbitMQ 为一个主要的科学计算网格构建一个多代理结果缓存系统.

For what it's worth, I've had some experience on writing e-mail processors (specifically xmail - one that I'd recommend if you're just starting out your project and have a lot of control over its early stages). Also, I'm currently using RabbitMQ to build a multi-agent result caching system for a major scientific computing grid.

无论如何...祝你的项目好运!

Anyway...good luck with your project!

这篇关于电子邮件解析和处理架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆