Scala-如何在最后一个点后对字符串名称进行子字符串化? [英] scala - how to substring column names after the last dot?
问题描述
分解嵌套结构后,我得到一个DataFrame,其列名称如下:
After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
执行选择时出现错误:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
我应该如何从DataFrame中进行选择,以便正确解析列名?
How should I select from the DataFrame so the column names are parsed correctly?
我尝试了以下操作:点提取成功后的子字符串.但是由于我也有没有点的列,例如 date
-它们的名称已被完全删除.
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date
- their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
我只想保留metric1,metric2,metric3
I want to leave just metric1, metric2, metric3
推荐答案
您可以使用反引号选择名称包含句点的列.
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
此外,您可以像这样轻松地重命名它们.基本上从当前的DataFrame开始,继续为每个字段使用新的列名对其进行更新,并返回最终结果.
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
获取最后一个组件
要仅使用姓氏名称进行重命名,此正则表达式将起作用:
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
获取最后两个组件
这有点复杂,也许有一种更简洁的编写方法,但这是一种可行的方法:
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema
这篇关于Scala-如何在最后一个点后对字符串名称进行子字符串化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!