erlang进程和消息传递架构 [英] erlang processes and message passing architecture

查看:161
本文介绍了erlang进程和消息传递架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是读取大文件的行,处理它们并返回排序结果。

The task I have in hand is to read the lines of large file, process them, and return ordered results.

我的算法是: strong>

My algorithm is:


  1. 从主程序开始,将评估工作负载(写在文件的第一行)
  2. $ b $生成工作流程:每个工作人员将使用pread / 3读取文件的一部分,处理此部分,并将结果发送到主
  3. 主接收所有子结果,排序,并返回
    ,所以基本上不需要工作人员之间的沟通。

  1. start with master process that will evaluate the workload (written in the first line of the file)
  2. spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
  3. master receives all sub-results, sort, and return so basically no communication needed between workers.

我的问题:


  1. 如何找到erlang进程数与内核数之间的最佳平衡?所以如果我为每个处理器核心产生一个进程,那么我将会使用我的cpu?

  2. pread / 3如何到达指定的行;它是否遍历文件中的所有行?并且是pread / 3一个很好的并行文件读取计划?

  3. 最好是从进程A发送到B还是发送N个小信息?我在下面的链接中找到了答案的一部分,但是我会感谢进一步的阐述。

    erlang消息传递架构

  1. How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
  2. How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
  3. Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
    erlang message passing architecture


推荐答案


  1. <呃Erlang过程很便宜。您有免费(并鼓励)使用比您拥有的更多核心。对于您的问题可能有一个上限(对于每行一个进程中加载​​1TB的数据会稍微增加一点,取决于行大小)。
  1. Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).

当你不知道的时候最简单的做法就是让用户决定。这意味着您可以决定产生 N 工人,并在他们之间分配工作,等待回覆。如果您不喜欢运行,请重新运行程序,同时更改 N

The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.

更好的做法它是基准一段时间,选择你认为有意义的最大价值,把它放在一个池图书馆(如果你想;一些池去预先分配的资源,一些可调整的数额),并解决什么将是一个一刀切的解决方案。

Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.

但是,真的,没有简单的最佳核心数。您可以在50个进程中运行它,如果需要,可以运行它们,其中65,000个进程;如果任务是尴尬的并行,虚拟机应该能够使用大部分内核,并使内核饱和。

But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.

-


  1. 并行文件读取是一个有趣的问题。它可能或可能不会更快(如直接评论所提到的),并且它可能只是代表一个加速,如果每行的工作足够小,阅读文件是最大的成本。

  1. Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.

棘手的一点是像 pread / 2-3 这样的函数需要一个字节偏移量。您的问题是措辞让您担心文件的。因此,您移交给工作人员的字节偏移量可能会跨越一条线路。如果你的程序段结束于我的中的这个词,这是我的行\\\\\\\\\\\\\\\\\\\\\\\\\工作人员将看到自己有一个不完整的行,而另一个将仅报告我的行\ ,缺少先前的这是

The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.

一般来说,这种令人讨厌的东西会导致您拥有第一个进程拥有该文件并对其进行筛选,只能切换文字处理工人;那么这个过程将作为某种协调者。

Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.

这个策略的好方面是,如果主进程知道作为消息发送的所有内容,它也知道当所有答复都已收到后,便可以很方便地知道何时返回结果。如果一切都是不相交的,你必须相信起始者和工作人员告诉你我们全部失业,作为一组独立的独立信息才能知道。

The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.

在实践中,您可能会发现,最有帮助的是了解有助于硬件使用文件操作的操作,而不是有多少人可以一次读取文件。只有一个硬盘(或SSD),所有的数据都要经过它;平行度最终可能受到限制。

In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.

-


  1. 使用对您的程序有意义的消息。最具性能的程序将有很多进程能够进行工作,而不需要传递消息,通信或获取锁。

  1. Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.

一个更现实的非常优秀的程序将使用非常少的消息是非常小的。

A more realistic very performant program would use very few messages of a very small size.

这里有趣的是你的问题本质上是基于数据的。所以你可以做一些事情:

The fun thing here is that your problem is inherently data-based. So there's a few things you can do:


  • 确保您以二进制格式阅读文本;大型二进制文件(> 64b)被分配到一个全局的二进制堆上,被共享,并且具有引用计数的GC'

  • 提供需要完成的信息,而不是进行数据它;这一个需要测量,但主导过程可能会超过文件,注意行结束,只是手工字节偏移到工作人员,以便他们可以去阅读文件本身;请注意,您最终会读取文件两次,因此如果内存分配不是您的主要开销,则可能会较慢

  • 确保文件已读取 raw ram mode;其他模式使用中间人进程来读取和转发数据(如果您通过集群Erlang节点中的网络读取文件,则这很有用); raw ram 模式将文件描述符直接提供给调用进程,速度快得多。

  • 首先担心写一个清晰,可读和正确的程序。只有当它太慢才能尝试重构和优化它;你可以在第一次尝试中找到足够好的。

  • make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
  • Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
  • Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
  • First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.

我希望这有助于。

PS您可以先尝试一下真正简单的东西:

P.S. You can try the really simple stuff at first:


  1. 或者:

  1. either:


  • 使用 {ok,Bin} = file:read_file(Path)和拆分行(含$ $ c $)读取整个文件c> binary:split(Bin,<\\\
    >>>,[global])
    ),

  • 使用 {ok,Io} = file:open(File,[read,ram])然后使用文件:read_line(Io)使用 {ok,Io} = file:open(File,[read,raw,{read_ahead,BlockSize}])然后在文件描述符上重复使用文件:read_line(Io)

  • read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
  • use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
  • use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly

调用 rpc:pmap({?MODULE,Function},ExtraArgs,Lines)自动并行运行一次(它将每行生成一个进程)

call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)

调用列表:sort / 1

然后从那里你可以细化每一步,如果你认为它们有问题。

Then from there you can refine each step if you identify them as problematic.

这篇关于erlang进程和消息传递架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆