数据框上的平面图 [英] Flatmap on dataframe

查看:98
本文介绍了数据框上的平面图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark中的DataFrame上预形成flatMap的最佳方法是什么? 通过搜索和测试,我提出了两种不同的方法.这两个都有缺点,所以我认为应该有一些更好/更简便的方法来实现它.

What is the best way to preform a flatMap on a DataFrame in spark? From searching around and doing some testing, I have come up with two different approaches. Both of these have some drawbacks so I'm thinking that there should be some better/easier way to do it.

我发现的第一种方法是先将DataFrame转换为RDD,然后再次返回:

The first way I have found is to first convert the DataFrame into an RDD and then back again:

val map = Map("a" -> List("c","d","e"), "b" -> List("f","g","h"))
val df = List(("a", 1.0), ("b", 2.0)).toDF("x", "y")

val rdd = df.rdd.flatMap{ row =>
  val x = row.getAs[String]("x")
  val x = row.getAs[Double]("y")
  for(v <- map(x)) yield Row(v,y)
}
val df2 = spark.createDataFrame(rdd, df.schema)

第二种方法是在使用flatMap(使用与上面相同的变量)之前创建一个DataSet,然后转换回去:

The second approach is to create a DataSet before using the flatMap (using the same variables as above) and then convert back:

val ds = df.as[(String, Double)].flatMap{
  case (x, y) => for(v <- map(x)) yield (v,y)
}.toDF("x", "y")

当列数很少时,这两种方法都可以很好地工作,但是我有2列以上.有没有更好的方法来解决这个问题?最好采用无需转换的方式.

Both these approaches work quite well when the number of columns are small, however I have a lot more than 2 columns. Is there any better way to solve this problem? Preferably in a way where no conversion is necessary.

推荐答案

您可以从map RDD创建第二个dataframe:

You can create a second dataframe from your map RDD:

val mapDF = Map("a" -> List("c","d","e"), "b" -> List("f","g","h")).toList.toDF("key", "value")

然后执行join并应用explode函数:

val joinedDF = df.join(mapDF, df("x") === mapDF("key"), "inner")
  .select("value", "y")
  .withColumn("value", explode($"value"))

您会找到解决方案.

joinedDF.show()

这篇关于数据框上的平面图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆