在Spark中对同一DataFrame并行执行独立操作 [英] Parallelizing independent actions on the same DataFrame in Spark

查看:466
本文介绍了在Spark中对同一DataFrame并行执行独立操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个具有以下架构的Spark DataFrame:

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

我想从这个DataFrame中随机选择两个不同的词,但是我想执行X次此操作,因此最后我将随机选择X个词组,并且当然,每个选择ID都是彼此独立的.我该如何完成?

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

示例:

假设这是我的数据集:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

其中第一个数字ID为prob,第二个数字ID为word.对于X = 5,输出为:

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

由于它们是独立的动作,因此您可以看到2和3相同,这很好.但是在每个元组中,一个单词只能重复一次.

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

推荐答案

1)您可以使用以下DataFrame方法之一:

1) You can use one of this DataFrame methods:

  • randomSplit(weights: Array[Double], seed: Long)
  • randomSplitAsList(weights: Array[Double], seed: Long)
  • sample(withReplacement: Boolean, fraction: Double)
  • randomSplit(weights: Array[Double], seed: Long)
  • randomSplitAsList(weights: Array[Double], seed: Long) or
  • sample(withReplacement: Boolean, fraction: Double)

然后进行前两行.

2)随机排列行并取其中的前两个.

2) Shuffle rows and take first two of them.

import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)

3)或者您可以使用RDD的takeSample方法,然后将其转换为DataFrame:

3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

例如:

dataframe.rdd.takeSample(true, 1000).toDF()

这篇关于在Spark中对同一DataFrame并行执行独立操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆