Spark DataSet 有效地获取整行的长度大小 [英] Spark DataSet efficiently get length size of entire row
问题描述
我正在处理不同大小的数据集,每个数据集都具有动态大小的列 - 对于我的应用程序,我需要知道字符的整个行长度,以估计以字节或千字节为单位的整个行大小.>
整个行大小(以 KB 为单位)的结果将写入新列.
private void writeMyData(Dataset dataSet){Column[] 列 = Arrays.stream(dfToWrite.columns()).map(col->functions.col(col)).toArray(Column[]::new);dataSet.withColumn("marker", functions.length(functions.concat_ws(dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn).option("header", "true").mode(SaveMode.Append).format(storageFormat).save(pathTowrite);}
因为我没有 org.apache.spark.sql.functions 返回 Column[]
所以我不得不使用 dataSet.columns()
并收集它.
但是每次使用嵌套操作function.method
似乎效率不高.
我宁愿有一个获取 Column[]
并返回列的整个长度的函数大小.而不是嵌套操作.
- 有什么方法可以帮助我使用 UDF 函数来进行此类操作?或者是否有这种操作的现有功能?
- 使用这种解决方案有多糟糕?
Java 解决方案是首选.
spark Dataframe UDF 的不错解决方案
static UDF1 BytesSize = new UDF1() {公共整数调用(最终字符串行)抛出异常{返回 line.getBytes().length;}};私有无效 saveIt(){sparkSession.udf().register("BytesSize",BytesSize,DataTypes.IntegerType);dfToWrite.withColumn("fullLineBytesSize",callUDF("BytesSize",functions.concat_ws(",",columns)) ).write().partitionBy(hivePartitionColumn).option("header", "true").mode(SaveMode.Append).format(storageFormat).save(pathTowrite);}
I'm working with different size of dataSet each one with a dynamic size of columns - for my application, I have a requirement to know the entire row length of characters for estimate the entire row size in Bytes or KBytes.
The result of entire row size(in KB) will be written to a new column.
private void writeMyData(Dataset<Row> dataSet){
Column[] columns = Arrays.stream(dfToWrite.columns()).map(col-> functions.col(col)).toArray(Column[]::new);
dataSet.withColumn("marker", functions.length(functions.concat_ws( dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn)
.option("header", "true")
.mode(SaveMode.Append).format(storageFormat).save(pathTowrite);
}
As I've none of the method of org.apache.spark.sql.functions return Column[]
So i had to use dataSet.columns()
and Collect it.
But using nested operation function.method
each time don't seem efficient.
I would rather have a function size that's gets Column[]
and return the entire length of the columns.
instead of having nested operation.
- Is there a way you can help me with UDF function for this kind of operation? Or is there an existing function for this kind of operation?
- How bad is it using this kind of solution?
Java solution is preferred.
nice solution with spark Dataframe UDF I have used to get Bytes length which is better for my case:
static UDF1 BytesSize = new UDF1<String, Integer>() {
public Integer call(final String line) throws Exception {
return line.getBytes().length;
}
};
private void saveIt(){
sparkSession.udf().register("BytesSize",BytesSize,DataTypes.IntegerType);
dfToWrite.withColumn("fullLineBytesSize",callUDF("BytesSize",functions.concat_ws( ",",columns)) ).write().partitionBy(hivePartitionColumn)
.option("header", "true")
.mode(SaveMode.Append).format(storageFormat).save(pathTowrite);
}
这篇关于Spark DataSet 有效地获取整行的长度大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!