如何确定对象是PySpark一个有效的键值对 [英] How to determine if object is a valid key-value pair in PySpark

查看:350
本文介绍了如何确定对象是PySpark一个有效的键值对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  1. 如果我有一个RDD,我该如何理解这些数据是关键:价值
    格式?有没有办法找到相同的 - 像
    类型(对象)告诉我一个对象的类型。我试图打印
    类型(rdd.take(1))
    ,但它只是说<类型列表方式>

  2. 比方说,我有一个数据,如(X,1),(X,2),(Y,1),(Y,3)我用
    groupByKey 并获得(X,(1,2)),(Y,(1,3))。有没有一种方法来定义
    (1,2)(1,3)作为其中x和y键的值?还是一个关键的必须是一个单一的价值呢?我说,如果我用 reduceByKey 函数来获取数据((X, 3),(Y,4))那么它变得更加容易来定义这个数据作为键值对


解决方案

Python是一种动态类型语言,PySpark不使用任何特殊类型的关键,值对。一个对象的唯一要求被认为是有效的数据 PairRDD 操作如下:它可以解开:

  K,V = KV

通常情况下,你会使用两元素元组由于它的语义(固定大小的不可变对象),相似的Scala 产品类。但是,这仅仅是一个约定,没有什么阻止你是这样的:

key_value.py

 类键值(对象):
    高清__init __(自我,K,V):
        self.k = K
        self.v = V
    高清__iter __(个体经营):
       在X [self.k,self.v]:
           是yield x

 从key_value进口KEYVALUERDD = sc.parallelize(
    [键值(富,1),键值(富,2),键值(酒吧,0)])rdd.reduceByKey(添加).collect()
## [('巴',0),('富',3)]

和进行任意类表现得像一个的键 - 值的。所以再次,如果事情可以正确解包成一对的对象则是一个有效的键 - 值的。实施 __ __ LEN __ __的GetItem 魔术方法应正常工作。

大概来处理这个最优雅的方式是使用 namedtuples 可惜这些都定义外部的任何其他类。

还有类型(rdd.take(1))返回列表长度 ñ所以它的类型将是永远不变的。

  1. If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same - something like type(object) tells me an object's type. I tried print type(rdd.take(1)), but it just says <type 'list'>.
  2. Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define (1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair

解决方案

Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

k, v = kv

Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

key_value.py

class KeyValue(object):
    def __init__(self, k, v):
        self.k = k
        self.v = v
    def __iter__(self):
       for x in [self.k, self.v]:
           yield x

from key_value import KeyValue

rdd = sc.parallelize(
    [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) 

rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]

and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well.

Probably the most elegant way to handle this is to use namedtuples but unfortunately these have to define externally as any other class.

Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

这篇关于如何确定对象是PySpark一个有效的键值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆