如何用同一数据帧中其他列的实际列值替换一列中的字符串值?第2部分 [英] How to replace string values in one column with actual column values from other columns in the same dataframe? Part 2

查看:78
本文介绍了如何用同一数据帧中其他列的实际列值替换一列中的字符串值?第2部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一列中有一些字符串值,我想将该列中的子字符串替换为其他列中的值,并用空格替换所有加号(如下所示)。

I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).

我有这些 List [String] 映射,这些映射是动态传递的,其中 mapFrom mapTo 应该与索引相关。

I have these List[String] mappings which are passed in dynamically where the mapFrom and mapTo should correlate in index.

描述值: mapFrom:[ Child, ChildAge, ChildState]

列名: mapTo:[ name, age,状态]

输入示例:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

预期结果:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

如何使用Spark Scala?

How can I achieve this using Spark Scala?

当我从此处尝试解决方案时:如何用同一数据帧中其他列的实际列值替换一列中的字符串值?

When I try the solution from here: How to replace string values in one column with actual column values from other columns in the same dataframe?

输出变为

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler


推荐答案

此处的问题是由于包含 Child 。这是 ChildAge ChildState 的子序列。由于使用了正则表达式,这意味着 Child 部分将被名称替换,从而产生奇怪的输出,例如 tiffanyAge tiffanyState (请注意,此处的 Child 部分已替换为名称)。

The problem here is due to the description containing Child. This is a subsequence of ChildAge and ChildState. Since a regex is used this means that the Child part will be replaced by the names resulting in strange outputs such as tiffanyAge and tiffanyState (note that the Child part here is replaced by the name).

在这种情况下,有两种简单的解决方案而不更改输入:

There are two simple solutions in this case without changing the input:


  1. 更改 Child 的正则表达式以使用超前:

  1. Change the regex for Child to use lookahead:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "

仅当后面有空格时,此子项将匹配 Child

This will only match Child when there is a space afterwards.

孩子放在列表的最后。这意味着 ChildAge ChildState 将首先匹配:

Put Child last in the list. This means that ChildAge and ChildState will be matched first:

val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "


第一种选择的完整解决方案:

Full solution with the first alternative:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}

这篇关于如何用同一数据帧中其他列的实际列值替换一列中的字符串值?第2部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆