如何在火花中将rdd数据一分为二? [英] how to divide rdd data into two in spark?

查看：97 发布时间：2020/9/4 7:08:38 python apache-spark pyspark rdd

本文介绍了如何在火花中将rdd数据一分为二?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark RDD中有一个数据，我想将其分为两部分，比例为0.7.例如，如果RDD如下所示:

I have a data in Spark RDD and I want to divide it into two part with a scale such as 0.7. For example if the RDD looks like this:

[1,2,3,4,5,6,7,8,9,10]

我想将其分为rdd1:

 [1,2,3,4,5,6,7]

和rdd2:

[8,9,10]

，范围为0.7.每次rdd1和rdd2应该是随机的.我这样尝试过:

with the scale 0.7. The rdd1 and rdd2 should be random every time. I tried this way:

seed = random.randint(0,10000)
rdd1 = data.sample(False,scale,seed)
rdd2 = data.subtract(rdd1)

，它有时可以工作，但是当我的数据包含dict时，我遇到了一些问题.例如，数据如下:

and it works sometimes but when my data contains dict I experienced some problems. For example with data as follows:

[{1:2},{3:1},{5:4,2;6}]

我知道

TypeError:不可散列的类型:'dict'

TypeError: unhashable type: 'dict'

推荐答案

都是RDDs

rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)

test.collect()
## [4, 7, 8]

train.collect()
## [0, 1, 2, 3, 5, 6, 9]

和DataFrames

df = rdd.map(lambda x: (x, )).toDF(["x"])

test, train = df.randomSplit(weights=[0.3, 0.7])

提供可在此处使用的randomSplit方法.

provide randomSplit method which can be used here.

注释:

RDD

filter

randomSplit.通常，不可能从单个Spark转换中产生多个RDDs.有关详细信息，请参见 https://stackoverflow.com/a/32971246/1560062 .

randomSplit is expressed using a single filter for each output RDD. In general it is not possible to yield multiple RDDs from a single Spark transformation. See https://stackoverflow.com/a/32971246/1560062 for details.

您不能将subtract与字典一起使用，因为它在内部表示为cogorup，并且因此要求对象为hashable.另请参见列表作为PySpark的reduceByKey的键

You cannot use subtract with dictionaries because internally it is expressed cogorup and because of that requires objects to be hashable. See also A list as a key for PySpark's reduceByKey

这篇关于如何在火花中将rdd数据一分为二?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在火花中将rdd数据一分为二? [英] how to divide rdd data into two in spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在火花中将rdd数据一分为二? [英] how to divide rdd data into two in spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭