如何相对于其他数据框更改数据框的列名 [英] How to change column name of a dataframe with respect to other dataframe

查看:84
本文介绍了如何相对于其他数据框更改数据框的列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用pyspark相对于其他数据框df_col更改数据框df的列名

I have a requirement to change column name of a dataframe df with respect to other dataframe df_col using pyspark

df

+----+---+----+----+
|code| id|name|work|
+----+---+----+----+
| ASD|101|John| DEV|
| klj|102| ben|prod|
+----+---+----+----+

df_col

+-----------+-----------+
|col_current|col_updated|
+-----------+-----------+
|         id|     Row_id|
|       name|       Name|
|       code|   Row_code|
|       Work|  Work_Code|
+-----------+-----------+

如果df列与col_current匹配,则df列应替换为col_updated.例如:如果df.id与df.col_current匹配,则df.id应替换为Row_id.

if df column matches col_current, df column should replace with col_updated. ex: if df.id matches df.col_current, df.id should replace with Row_id.

预期产量

Row_id,Name,Row_code,Work_code
101,John,ASD,DEV
102,ben,klj,prod

注意:我希望此过程是动态的.

Note: I want this process to be dynamic.

推荐答案

只需收集df_col作为字典:

df = spark.createDataFrame(
    [("ASD", "101" "John", "DEV"), ("klj","102", "ben", "prod")],
    ("code", "id", "name", "work")
)

df_col = spark.createDataFrame(
    [("id", "Row_id"), ("name", "Name"), ("code", "Row_code"), ("Work", "Work_Code")],
    ("col_current", "col_updated")
)

name_dict = df_col.rdd.collectAsMap()

并使用select进行列表理解:

df.select([df[c].alias(name_dict.get(c, c)) for c in df.columns]).printSchema()
# root
#  |-- Row_code: string (nullable = true)
#  |-- Row_id: string (nullable = true)
#  |-- Name: string (nullable = true)
#  |-- work: string (nullable = true)

其中name_dict是标准的Python词典:

where name_dict is standard Python dictionary:

{'Work': 'Work_Code', 'code': 'Row_code', 'id': 'Row_id', 'name': 'Name'}

name_dict.get(c, c)获取新名称,给定当前名称,或者如果不匹配,则使用当前名称:

name_dict.get(c, c) gets new name, given current name, or current name if no match:

name_dict.get("code", "code")
# 'Row_code'

name_dict.get("work", "work")  # Case sensitive 
# 'work'

alias只是将列(df[col])重命名为从name_dict.get返回的名称.

and alias just renames column (df[col]) to name returned from name_dict.get.

这篇关于如何相对于其他数据框更改数据框的列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆