如何确定对象是PySpark一个有效的键值对 [英] How to determine if object is a valid key-value pair in PySpark
问题描述
- 如果我有一个RDD,我该如何理解这些数据是关键:价值
格式?有没有办法找到相同的 - 像
类型(对象)告诉我一个对象的类型。我试图打印
,但它只是说
类型(rdd.take(1))<类型列表方式>
- 比方说,我有一个数据,如
(X,1),(X,2),(Y,1),(Y,3)
我用groupByKey
并获得(X,(1,2)),(Y,(1,3))
。有没有一种方法来定义(1,2)
和(1,3)
作为其中x和y键的值?还是一个关键的必须是一个单一的价值呢?我说,如果我用reduceByKey
和之
函数来获取数据((X, 3),(Y,4))
那么它变得更加容易来定义这个数据作为键值对
Python是一种动态类型语言,PySpark不使用任何特殊类型的关键,值对。一个对象的唯一要求被认为是有效的数据 PairRDD
操作如下:它可以解开:
K,V = KV
通常情况下,你会使用两元素元组
由于它的语义(固定大小的不可变对象),相似的Scala 产品
类。但是,这仅仅是一个约定,没有什么阻止你是这样的:
的 key_value.py 的
类键值(对象):
高清__init __(自我,K,V):
self.k = K
self.v = V
高清__iter __(个体经营):
在X [self.k,self.v]:
是yield x
从key_value进口KEYVALUERDD = sc.parallelize(
[键值(富,1),键值(富,2),键值(酒吧,0)])rdd.reduceByKey(添加).collect()
## [('巴',0),('富',3)]
和进行任意类表现得像一个的键 - 值的。所以再次,如果事情可以正确解包成一对的对象则是一个有效的键 - 值的。实施 __ __ LEN
和 __ __的GetItem
魔术方法应正常工作。
大概来处理这个最优雅的方式是使用 namedtuples
可惜这些都定义外部的任何其他类。
还有类型(rdd.take(1))
返回列表
长度 ñ
所以它的类型将是永远不变的。
- If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried
print type(rdd.take(1))
, but it just says<type 'list'>
. - Let's say I have a data like
(x,1),(x,2),(y,1),(y,3)
and I usegroupByKey
and got(x,(1,2)),(y,(1,3))
. Is there a way to define(1,2)
and(1,3)
as values where x and y are keys? Or does a key has to be a single value? I noted that if I usereduceByKey
andsum
function to get the data((x,3),(y,4))
then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD
operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple
due to its semantics (immutable object of fixed size) and similarity to Scala Product
classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__
and __getitem__
magic methods should work as well.
Probably the most elegant way to handle this is to use namedtuples
but unfortunately these have to define externally as any other class.
Also type(rdd.take(1))
returns a list
of length n
so its type will be always the same.
这篇关于如何确定对象是PySpark一个有效的键值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!