Spark Java:矢量汇编程序的列名中的转义点 [英] Spark Java: Escape dot in column names for vector assembler

查看:62
本文介绍了Spark Java:矢量汇编程序的列名中的转义点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中某些列名带有点。当涉及Vector Assembler时会出现问题。看来他们并没有相处,所以我试图以多种方式逃避点,但没有任何改变。

I have a Dataset where some column names have dots. The problem arises when it comes to Vector Assembler. It seems that they do not get along, so I tried to escape the dots in many ways but nothing changed.

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

我用expincols将每个元素包装为:`,``, ``'',``'',``,'',``等等'',但是什么都没有!我还在newfilenameavgpeaks的列名中尝试了这些,但还是没有。有任何想法如何转义吗?

I have wrapped every element in expincols with: "`", "``","```","````","'",'"', etc but nothing! I also tried these in the column names of newfilenameavgpeaks but still nothing. Any ideas how to escape?

推荐答案

如果数据集包含列 ab 您仍然可以使用 df.col(ʻab`)选择名称为的列。这是有效的,因为 Dataset.col 尝试解析列名并可以处理反引号。

If the dataset contains a column a.b you can still use df.col(`a.b`) to select a column with a . in its name. This works because Dataset.col tries to resolve the column name and can handle the backticks.

VectorAssembler.transform 但是采用提供的数据集的架构并使用此 StructType 处理 VectorAssembler.transformSchema StructType的应用方法不包含处理反引号的逻辑,如果列名不完全匹配,则抛出 IllegalArgumentException

VectorAssembler.transform however takes the schema of the supplied dataset and uses this StructType to handle the column names in VectorAssembler.transformSchema. The apply method of StructType simply does not contain the logic to handle the backticks and throws an IllegalArgumentException if the column names do not match exactly.

因此,唯一的选择是在将列提供给VectorAssembler之前重命名这些列:

Therefore the only option is to rename the columns before supplying them to the VectorAssembler:

Dataset<Row> newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace('.', '_'));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

这篇关于Spark Java:矢量汇编程序的列名中的转义点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆