什么时候取从卡桑德拉发生 [英] When does fetch happen from Cassandra

查看:205
本文介绍了什么时候取从卡桑德拉发生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个触发任务的火花主应用程序。但是,当我检查IP地址执行作业,它显示我的应用程序的IP,而不是火花工人IP。所以,从我的理解,对RDD通话产生的火花工人上班。

但我的问题是这样的。

  CassandraSQLContext C =新CassandraSQLContext(SC);QueryExecution Q = c.executeSql(cqlCommand); // ----- 1q.toRDD()计数()。 // ---- 2

我看到工人做的事情为2,但没有为1。

那么,这是否意味着从卡桑德拉和RDD创作取出来的1中的应用程序全部完成?

如果是这样,2确实触发一个作业,以两个工人。在这种情况下,它再从卡桑德拉撷取并处理的计

有人能澄清这?

修改


  1. 提供的答案去,如果算上调用触发工人发挥作用,那么什么是创造一个地方RDD使用的ExecuteSQL的?这是否通过查询创建数据的卡桑德拉数据集?如果是这样的情况下,从卡桑德拉查询发生两次?

2。如果火花中4个工人,谁将会汇总结果自动分配的卡桑德拉的10个分区计算的呢?师父只是做了分配。因此,它聚集了吗?

<醇开始=3>
  • 如果我不缓存RDD做一套计数操作,会发生什么?将引发尝试使用已使用previously特定分区相同的工人,追加到该节点的结果RDD。我认为它有查询卡桑德拉再次获得这个分区的数据?你能否提供一些清晰度在这?


  • 如果我缓存我RDD,会发生什么? RDD存储在工人,它将被用于所有操作?在这种情况下,这是如何不同于我们存储在存储器中的数据集,并处理它?让我知道如果这个正确的了。



  • 解决方案

    星火RDD的像你定制列表命令加载和转换都懒洋洋地评估。

    动作触发所有的precursor变换来运行,所以在你的榜样,COUNT()是一款动作。

    的方式火花内部工作原理是,它建立了变换的曲线图。时,它需要执行的操作,它会破坏的曲线成能够由单个工人进行单独运行的子任务。

    要做到像数(一个动作),数据将只从卡桑德拉获取一次,如果可能的话,每个执行人的RDD将从就是本地为每个卡桑德拉节点的数据进行填充。

    如果你做得自q创建RDD另一个动作,它仍可以缓存在内存中,并可以重复使用。有API调用就可以做出明确要求的RDD在内存中,如果你计划重新使用高速缓存。

    I have an application that triggers the job to the spark master. But when I check the IP address executing the job, its displaying my application IP and not the spark worker IP. So, from what I understand, the call on RDD generates a spark worker to work.

    But my question is this.

    CassandraSQLContext c = new CassandraSQLContext(sc);
    
    QueryExecution q=c.executeSql(cqlCommand); //-----1
    
    q.toRDD().count(); //----2
    

    I saw the worker doing something for 2 but nothing for 1.

    So does this mean fetch from Cassandra and RDD creation out of it in 1 is all done in the application?

    If so, 2 does trigger a job to two workers. In that case, does it fetch again from Cassandra and process the count?

    Can someone clarify this??

    EDIT

    1. Going by the answer provided, if the count call triggers the workers to function, then what is the use of executeSQL creating a RDD in local? Does that create a Cassandra dataset of the data by querying ? If that's the case, querying from Cassandra happens twice?

    2.. If spark automatically distributes the computations of 10 partitions of Cassandra among 4 workers, who will aggregate the results? Master is just doing the distribution. So does it aggregate too?

    1. If I don't cache the RDD and do another count operation, what will happen? Will spark try to to use the same worker that was used previously for a particular partition and append to the result RDD in that node. I think it has to query Cassandra to get this partition data again? Can you provide some clarity in this?

    2. If I cache my RDD, what happens? RDD is stored in the worker and it will be used for all operations? In that case, how this is different from we storing a dataset in memory and processing it? Let me know if an right about this too.

    解决方案

    Spark loading and transformations of RDD's like your CQL command are lazily evaluated.

    Actions trigger all of the precursor transformations to be run, so in your example, count() is an action.

    The way Spark works internally is that it builds up a graph of transformations. When it needs to run an action, it will break the graph up into separate sub-tasks that can be run by the individual workers.

    To do a single action like count(), the data will only be fetched from Cassandra once, and if possible, the RDD for each executor would be populated from the data that is local to each Cassandra node.

    If you do another action on the RDD created from q, it may still be cached in memory and will be reused. There are API calls you can make to explicitly request that an RDD be cached in memory if you plan to re-use it.

    这篇关于什么时候取从卡桑德拉发生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆