根据条件筛选RDD和提取匹配的星火Python数据 [英] Filtering RDD Based on condition and extracting matched data in Spark python

查看:749
本文介绍了根据条件筛选RDD和提取匹配的星火Python数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的数据,

  cl_id cn_id cn_value
10004,77173296,390.0
10004,77173299,376.0
10004,77173300,0.0
20005,77173296,0.0
20005,77173299,6.0
2005年,77438800,2.0

Cl_id编号:10004,20005

由10004过滤器

  10004,77173296,390.0
10004,77173299,376.0

由20005过滤器

  20005,77173296,0.0
20005,77173299,6.0

现在我想回报RDD一样,

  10004,cn_id,X1(77173296.value,77173300.value)==> 10004,77173296,390.0,376.0
20005,cn_id,X1(77173296.value,77173300.value)==> 20005,77173296,0.0,6.0

和我想执行这方面的一些操作 return_RDD

 高清cal_for(rdd_list):
     #list.map(位置1).filter(cn_id此公式) - >计算这个公式 - >存储在单独的RDD - >返回RDD     rdd_list = rdd_list.map(拉姆达行:line.split(,))
     new_list = rdd_list.map(拉姆达X:(X [0] +','+ X [1],浮法(X [2])))
     new_list = rdd_list.filter(拉姆达X:X [1] =='77173296'和;&放大器; X [1] =='77173299')
     然后##得到各自包含了cn_values​​ cn_id 77173296&放​​的RDD; cn_id 77173299
     ##和应用以下公式为whre cl_id 1004 = 77173296.value B = 77173299.value    尝试:
        #要处理RDD这个公式
        收益率((浮点(一)/浮点(A + B))* 100)
    除了ZeroDivisionError:
        返回0#return或保存cal​​_RDD


解决方案

而不是过滤RDD两次,修改和重组所产生的RDDS,只需通过ID组中,然后映射所有的值,让你需要的任何更改。如果您想进一步限制根据某些条件的结果,然后执行滤镜,同时映射。

我真的不能给你一个更precise答案:

A)这并不像你真的想还没有实现这一点,
B)我不完全肯定你想要的东西。

I have the data like,

cl_id      cn_id        cn_value
10004,     77173296     ,390.0
10004,     77173299     ,376.0
10004,     77173300     ,0.0
20005,     77173296     ,0.0
20005,     77173299     ,6.0
2005,      77438800     ,2.0

Cl_id IDs: 10004 ,20005

Filter by 10004

10004,     77173296     ,390.0
10004,     77173299     ,376.0

Filter by 20005

20005,    77173296    ,0.0
20005,    77173299     ,6.0

Now I want the return RDD like,

10004,cn_id,x1(77173296.value,77173300.value) ==> 10004,77173296,390.0,376.0
20005,cn_id,x1(77173296.value,77173300.value) ==> 20005,77173296,0.0,6.0 

And I want to perform some operation on this return_RDD:

 def cal_for(rdd_list):
     #list.map(position1).filter(cn_id for this formula)-> calculate that formula -> store in a separate RDD -> Return that RDD

     rdd_list = rdd_list.map(lambda line:line.split(','))
     new_list = rdd_list.map(lambda x: (x[0]+', '+x[1],float(x[2])))
     new_list = rdd_list.filter(lambda x: x[1] == '77173296' && x[1] ==  '77173299')
     ## then  get the  RDD containing respective cn_values for cn_id 77173296 & cn_id 77173299
     ## and apply the following formula whre a=77173296.value b=77173299.value for cl_id 1004

    try:
        # want to process RDD with this  Formula
        return ((float(a)/float(a+b))*100)
    except ZeroDivisionError:
        return 0

#return or save cal_RDD

解决方案

Instead of filtering the RDD twice, modifying and recombining the resulting RDDs, simply group by id, then map over the values to make any changes you need. If you want to further limit the results based on some criteria, then perform a filter while mapping.

I can't really give you a more precise answer as:

a) It doesn't look like you've really tried to implement this yet, and b) I'm not entirely certain what you want.

这篇关于根据条件筛选RDD和提取匹配的星火Python数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆