pySpark 映射多列 [英] pySpark mapping multiple columns

查看：39 发布时间：2021/12/22 21:15:39 dataframe dictionary pyspark pyspark-dataframes

本文介绍了pySpark 映射多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要能够使用多列比较两个数据框.

I need to be able to compare two dataframes using multiple columns.

pySpark 尝试

# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. 

primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter 

Out: ['Archive',
 'Pending Security Deposit',
 'Partially Abandoned',
 'Revision Contract Review',
 'Open',
 'Draft Accounting In Review',
 'Draft Returned']


# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))

display(output)

推荐答案

根据我的理解，您可以基于来自 reference_df 的列创建地图(我假设这不是一个非常大的数据框):

From my understanding, you can create a map based on columns from reference_df (I assumed this is not a very big dataframe):

map_key = concat_ws('', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue

然后用这个映射得到df1中对应的值:

and then use this mapping to get the corresponding values in df1:

from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map

d = reference_df.agg(collect_set(array(concat_ws('','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatusx00Abandoned', 'Active'],
# ['LeaseRecoveryTypex00Gross-modified', 'Modified Gross'],
# ['LeaseStatusx00Archive', 'Expired'],
# ['LeaseStatusx00Terminated', 'Terminated'],
# ['LeaseRecoveryTypex00Gross w/base year', 'Modified Gross'],
# ['LeaseStatusx00Draft', 'Pending'],
# ['LeaseRecoveryTypex00Gross', 'Gross']]

mappings = create_map([lit(i) for i in chain.from_iterable(d)])

primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

df1.select("*", *[ mappings[concat_ws('', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|          ABC123|...|                                   null|                                          Gross|                               Terminated|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Expired|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+

更新:从通过reference_df数据帧检索的信息中设置列名:

UPDATE: to set Column names from the information retrieved through reference_df dataframe:

# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()])

注意-1:最好循环遍历 primaryLookupAttributeName_List 以便保留列的顺序，以防 primaryLookupAttributeName_List 中的任何条目如果字典中缺少列名，我们可以设置一个默认的列名，即 Unknown-.在旧方法中，缺少条目的列会被简单地丢弃.

Note-1: it is better to loop through primaryLookupAttributeName_List so the order of the columns are preserved and in case any entries in primaryLookupAttributeName_List is missing from the dictionary, we can set a default column-name, i.e. Unknown-<col>. In the old method, columns with the missing entries are simply discarded.

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])

注意-2:每个评论，覆盖现有的列名(未测试):

Note-2: per comments, to overwrite the existing column names(untested):

(1) 使用选择:

test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()

(2) 使用reduce(如果List很长不推荐):

(2) use reduce (not recommended if the List is very long):

from functools import reduce

df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)

参考:PySpark 从字典创建映射

这篇关于pySpark 映射多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pySpark 映射多列 [英] pySpark mapping multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pySpark 映射多列 [英] pySpark mapping multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭