如何在火花中将rdd数据一分为二? [英] how to divide rdd data into two in spark?
问题描述
我在Spark RDD中有一个数据,我想将其分为两部分,比例为0.7.例如,如果RDD如下所示:
I have a data in Spark RDD and I want to divide it into two part with a scale such as 0.7. For example if the RDD looks like this:
[1,2,3,4,5,6,7,8,9,10]
我想将其分为rdd1
:
[1,2,3,4,5,6,7]
和rdd2
:
[8,9,10]
,范围为0.7.每次rdd1
和rdd2
应该是随机的.我这样尝试过:
with the scale 0.7. The rdd1
and rdd2
should be random every time. I tried this way:
seed = random.randint(0,10000)
rdd1 = data.sample(False,scale,seed)
rdd2 = data.subtract(rdd1)
,它有时可以工作,但是当我的数据包含dict
时,我遇到了一些问题.例如,数据如下:
and it works sometimes but when my data contains dict
I experienced some problems. For example with data as follows:
[{1:2},{3:1},{5:4,2;6}]
我知道
TypeError:不可散列的类型:'dict'
TypeError: unhashable type: 'dict'
推荐答案
都是RDDs
rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)
test.collect()
## [4, 7, 8]
train.collect()
## [0, 1, 2, 3, 5, 6, 9]
和DataFrames
df = rdd.map(lambda x: (x, )).toDF(["x"])
test, train = df.randomSplit(weights=[0.3, 0.7])
提供可在此处使用的randomSplit
方法.
provide randomSplit
method which can be used here.
注释:
-
对于每个输出
-
randomSplit
.通常,不可能从单个Spark转换中产生多个RDDs
.有关详细信息,请参见 https://stackoverflow.com/a/32971246/1560062 .
RDD
,使用单个filter
表示randomSplit
is expressed using a singlefilter
for each outputRDD
. In general it is not possible to yield multipleRDDs
from a single Spark transformation. See https://stackoverflow.com/a/32971246/1560062 for details.
您不能将subtract
与字典一起使用,因为它在内部表示为cogorup
,并且因此要求对象为hashable
.另请参见列表作为PySpark的reduceByKey的键
You cannot use subtract
with dictionaries because internally it is expressed cogorup
and because of that requires objects to be hashable
. See also A list as a key for PySpark's reduceByKey
这篇关于如何在火花中将rdd数据一分为二?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!