Spark DataSet 有效地获取整行的长度大小 [英] Spark DataSet efficiently get length size of entire row

查看:28
本文介绍了Spark DataSet 有效地获取整行的长度大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理不同大小的数据集,每个数据集都具有动态大小的列 - 对于我的应用程序,我需要知道字符的整个行长度,以估计以字节或千字节为单位的整个行大小.

整个行大小(以 KB 为单位)的结果将写入新列.

private void writeMyData(Dataset dataSet){Column[] 列 = Arrays.stream(dfToWrite.columns()).map(col->functions.col(col)).toArray(Column[]::new);dataSet.withColumn("marker", functions.length(functions.concat_ws(dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn).option("header", "true").mode(SaveMode.Append).format(storageFormat).save(pathTowrite);}

因为我没有 org.apache.spark.sql.functions 返回 Column[]所以我不得不使用 dataSet.columns() 并收集它.

但是每次使用嵌套操作function.method似乎效率不高.

我宁愿有一个获取 Column[] 并返回列的整个长度的函数大小.而不是嵌套操作.

  1. 有什么方法可以帮助我使用 UDF 函数来进行此类操作?或者是否有这种操作的现有功能?
  2. 使用这种解决方案有多糟糕?

Java 解决方案是首选.

解决方案

spark Dataframe UDF 的不错解决方案

static UDF1 BytesSize = new UDF1() {公共整数调用(最终字符串行)抛出异常{返回 line.getBytes().length;}};私有无效 saveIt(){sparkSession.udf().register("BytesSize",BytesSize,DataTypes.IntegerType);dfToWrite.withColumn("fullLineBytesSize",callUDF("BytesSize",functions.concat_ws(",",columns)) ).write().partitionBy(hivePartitionColumn).option("header", "true").mode(SaveMode.Append).format(storageFormat).save(pathTowrite);}

I'm working with different size of dataSet each one with a dynamic size of columns - for my application, I have a requirement to know the entire row length of characters for estimate the entire row size in Bytes or KBytes.

The result of entire row size(in KB) will be written to a new column.

private void writeMyData(Dataset<Row> dataSet){

        Column[] columns = Arrays.stream(dfToWrite.columns()).map(col-> functions.col(col)).toArray(Column[]::new);

        dataSet.withColumn("marker", functions.length(functions.concat_ws( dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn)
                .option("header", "true")
                .mode(SaveMode.Append).format(storageFormat).save(pathTowrite);

}

As I've none of the method of org.apache.spark.sql.functions return Column[] So i had to use dataSet.columns() and Collect it.

But using nested operation function.method each time don't seem efficient.

I would rather have a function size that's gets Column[] and return the entire length of the columns. instead of having nested operation.

  1. Is there a way you can help me with UDF function for this kind of operation? Or is there an existing function for this kind of operation?
  2. How bad is it using this kind of solution?

Java solution is preferred.

解决方案

nice solution with spark Dataframe UDF I have used to get Bytes length which is better for my case:

static UDF1 BytesSize = new UDF1<String, Integer>() {
    public Integer call(final String line) throws Exception {
        return line.getBytes().length;
    }
};

private void saveIt(){

sparkSession.udf().register("BytesSize",BytesSize,DataTypes.IntegerType);
    dfToWrite.withColumn("fullLineBytesSize",callUDF("BytesSize",functions.concat_ws( ",",columns)) ).write().partitionBy(hivePartitionColumn)
                    .option("header", "true")
                    .mode(SaveMode.Append).format(storageFormat).save(pathTowrite);
}

这篇关于Spark DataSet 有效地获取整行的长度大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆