关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 [英] About how to add a new column to an existing DataFrame with random values in Scala

查看：93 发布时间：2021/11/12 5:33:06 scala apache-spark random apache-spark-sql user-defined-functions

本文介绍了关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有镶木地板文件的数据框，我必须添加一个包含一些随机数据的新列，但我需要这些随机数据彼此不同.这是我的实际代码，spark 的当前版本是 1.5.1-cdh-5.5.2:

i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other. This is my actual code and the current version of spark is 1.5.1-cdh-5.5.2:

val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686 
mydf.cache

val r = scala.util.Random
import org.apache.spark.sql.functions.udf
def myNextPositiveNumber :String = { (r.nextInt(Integer.MAX_VALUE) + 1 ).toString.concat("D")}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))

有了这个代码，我就有了这个数据:

with this code, I have this data:

scala> myNewDF.select("myNewColumn").show(10,false)
+-----------+
|myNewColumn|
+-----------+
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
+-----------+

看起来 udf myNextPositiveNumber 只被调用一次，不是吗?

It looks like that the udf myNextPositiveNumber is invoked only once, isn't?

更新确认，只有一个不同的值:

update confirmed, there is only one distinct value:

scala> myNewDF.select("myNewColumn").distinct.show(50,false)
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
...

+-----------+                                                                   
|myNewColumn|
+-----------+
|889488717D |
+-----------+

我做错了什么?

更新 2:最后，在@user6910411 的帮助下，我得到了这个代码:

Update 2: finally, with the help of @user6910411 I have this code:

val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686 
mydf.cache

val r = scala.util.Random

import org.apache.spark.sql.functions.udf

val accum = sc.accumulator(1)

def myNextPositiveNumber():String = {
   accum+=1
   accum.value.toString.concat("D")
}

val myFunction = udf(myNextPositiveNumber _)

val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))

myNewDF.select("myNewColumn").count

// 63385686

更新 3

实际代码生成的数据如下:

Actual code generates data like this:

scala> mydf.select("myNewColumn").show(5,false)
17/02/22 11:01:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+-----------+
|myNewColumn|
+-----------+
|2D         |
|2D         |
|2D         |
|2D         |
|2D         |
+-----------+
only showing top 5 rows

看起来 udf 函数只被调用一次，不是吗?我需要在该列中添加一个新的随机元素.

It looks like the udf function is invoked only once, isn't? I need a new random element in that column.

更新 4 @user6910411

update 4 @user6910411

我有这个增加 id 的实际代码，但它没有连接最终的字符，这很奇怪.这是我的代码:

i have this actual code that increases the id but it is not concatenating the final char, it is weird. This is my code:

import org.apache.spark.sql.functions.udf


val mydf = sqlContext.read.parquet("some.parquet")

mydf.cache

def myNextPositiveNumber():String = monotonically_increasing_id().toString().concat("D")

val myFunction = udf(myNextPositiveNumber _)

val myNewDF = mydf.withColumn("myNewColumn",expr(myNextPositiveNumber))

scala> myNewDF.select("myNewColumn").show(5,false)
17/02/22 12:00:02 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_4_0]
+-----------+
|myNewColumn|
+-----------+
|0          |
|1          |
|2          |
|3          |
|4          |
+-----------+

我需要类似的东西:

+-----------+
|myNewColumn|
+-----------+
|1D         |
|2D         |
|3D         |
|4D         |
+-----------+

关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 [英] About how to add a new column to an existing DataFrame with random values in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 [英] About how to add a new column to an existing DataFrame with random values in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭