如何在python中比较两个DataFrame(StructType) [英] How to compare two DataFrame (StructType) in python

查看:93
本文介绍了如何在python中比较两个DataFrame(StructType)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本质上这是为了比较两个数据框,我可以将它们的名称与:

def diff(first, second):第二 = 设置(第二)如果项目不在第二个,则返回 [第一个项目的项目]

但我还想不仅在名称上进行比较,还想在数据类型上进行比较

示例数据框如下:

<预><代码>>>>pDF1.schema结构类型(列表(StructField(Scen_Id,IntegerType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型,字符串类型,真),StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))>>>PDF2.schema结构类型(列表(StructField(Scen_Id,StringType,true),StructField(Flow_Direction,StringType,true),结构域(数据集类型,字符串类型,真),StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)))

从这个特殊的简化示例中可以看出(通常情况下我们的数据帧包含超过 100 个字段),pDF2 与 pDF1 具有相同的名称/数据类型,除了第一个字段具有不同的数据类型.

非常感谢.

解决方案

好的,所以答案确实很简单,如下供未来读者参考:

def diff(first, second):第二 = 设置(第二)如果项目不在第二个,则返回 [第一个项目的项目]dl1_fields = 列表(pDF1.schema.fields)dl2_fields = 列表(pDF2.schema.fields)打印(==========================================================")print("模式比较结果:")打印(==========================================================")dl1Notdl2 = diff(dl1_fields, dl2_fields)打印(str(len(dl1Notdl2))+第一个df中的列,但不在第二个中")pprint.pprint(dl1Notdl2)打印(==========================================================")dl2Notdl1 = diff(dl2_fields, dl1_fields)打印(str(len(dl2Notdl1))+列在第二个df但不在第一个")pprint.pprint(dl2Notdl1)

Essentially this is to compare two dataframes, I am able to compare their names with:

def diff(first, second):
    second = set(second)
    return [item for item in first if item not in second]

But I also want to compare not only on name but also on datatype

Sample dataframe as below:

>>> pDF1.schema
StructType(
List(
StructField(Scen_Id,IntegerType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)

>>> pDF2.schema
StructType(
List(
StructField(Scen_Id,StringType,true),
StructField(Flow_Direction,StringType,true),
StructField(Dataset_Type,StringType,true),
StructField(Flag_Extrapolation_Percent_Change_Stay,IntegerType,true)
)
)

As you can see from this particular simplified example(often the case our dataframe contains over 100 fields), pDF2 has the same name/datatypeas pDF1, except for the first field, which has different datatype.

Thank you very much.

解决方案

OK, so the answer is indeed very straightforward as below for future reader's reference:

def diff(first, second):
    second = set(second)
    return [item for item in first if item not in second]

dl1_fields = list(pDF1.schema.fields)

dl2_fields = list(pDF2.schema.fields)

print("=========================================================")
print("schema comparison result:")
print("=========================================================")
dl1Notdl2 = diff(dl1_fields, dl2_fields)
print(str(len(dl1Notdl2)) + " columns in first df but not in second")
pprint.pprint(dl1Notdl2)
print("=========================================================")
dl2Notdl1 = diff(dl2_fields, dl1_fields)
print(str(len(dl2Notdl1)) + " columns in second df but not in first")
pprint.pprint(dl2Notdl1)

这篇关于如何在python中比较两个DataFrame(StructType)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆