scala - 如何在最后一个点之后对列名进行子串化? [英] scala - how to substring column names after the last dot?

查看:25
本文介绍了scala - 如何在最后一个点之后对列名进行子串化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分解嵌套结构后,我有一个列名如下的 DataFrame:

After exploding a nested structure I have a DataFrame with column names like this:

sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3

执行选择时出现错误:

cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]

我应该如何从 DataFrame 中进行选择,以便正确解析列名?

How should I select from the DataFrame so the column names are parsed correctly?

我尝试了以下方法:成功提取点后的子字符串.但是因为我也有像 date 这样没有点的列 - 它们的名字被完全删除了.

I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.

var salesDf_new = salesDf 
for(col <- salesDf .columns){
  salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}

我只想留下 metric1, metric2, metric3

I want to leave just metric1, metric2, metric3

推荐答案

您可以使用反引号来选择名称中包含句点的列.

You can use backticks to select columns whose names include periods.

val df = (1 to 1000).toDF("column.a.b")

df.printSchema
// root
//  |-- column.a.b: integer (nullable = false)

df.select("`column.a.b`")

此外,您可以像这样轻松地重命名它们.基本上从您当前的 DataFrame 开始,使用每个字段的新列名不断更新它并返回最终结果.

Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)

获取最后一个组件

要仅使用姓氏组件重命名,此正则表达式将起作用:

To rename with just the last name component, this regex will work:

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)

编辑 2:获取最后两个组件

这有点复杂,可能有更简洁的写法,但这里有一种有效的方法:

This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:

val pattern = (
    ".*?"  +          // Lazy match leading chars so we ignore that bits we don't want
    "([^.]+\\.)?" +   // Optional 2nd to last group
    "([^.]+)$"        // Last group
)

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

这篇关于scala - 如何在最后一个点之后对列名进行子串化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆