PySpark 数据框 to_json() 函数 [英] PySpark dataframe to_json() function

查看:87
本文介绍了PySpark 数据框 to_json() 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框,

<预><代码>>>>df.show(10,False)+-----+----+---+------+|id |姓名|年龄|薪水|+-----+----+---+------+|10001|亚历克斯|30 |75000 ||10002|鲍勃 |31 |80000 ||10003|deb |31 |80000 ||10004|约翰|33 |85000 ||10005|山姆 |30 |75000 |+-----+----+---+------+

将df的整行转换成一个新列jsonCol",

<预><代码>>>>newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))>>>newDf1.show(10,False)+-----+----+---+------+---------------------------------------------------------+|id |姓名|年龄|薪水|jsonCol |+-----+----+---+------+---------------------------------------------------------+|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}||10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} ||10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} ||10004|约翰|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}||10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |+-----+----+---+------+---------------------------------------------------------+

不是像上面的步骤那样将整行转换为 JSON 字符串,我需要一个解决方案来根据字段的值仅选择几列.我在以下命令中提供了示例条件.

但是当我开始使用 when 函数时,生成的 JSON 字符串的列名(键)消失了.仅按位置获取列名,而不是实际的列名(键)

<预><代码>>>>newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!=" ",df[x]).otherwise(None) for x in df.columns])))>>>newDf2.show(10,False)+-----+----+---+------+----------------------------------------------------------+|id |姓名|年龄|薪水|jsonCol |+-----+----+---+------+----------------------------------------------------------+|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}||10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} ||10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} ||10004|约翰|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}||10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |+-----+----+---+------+----------------------------------------------------------+

我需要使用 when 函数,但需要使用实际的列名(键)来获得 newDf1 中的结果.有人可以帮我吗?

解决方案

您在 struct 函数中使用了条件作为列,并且条件列被重命名为 col1 col2 .... 这就是为什么你需要 alias 来改变名字

from pyspark.sql import 函数为 FnewDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!=" ",df[x]).otherwise(None).alias(x) forx 在 df.columns])))newDf2.show(truncate=False)

I have a dataframe like below,

>>> df.show(10,False)
+-----+----+---+------+
|id   |name|age|salary|
+-----+----+---+------+
|10001|alex|30 |75000 |
|10002|bob |31 |80000 |
|10003|deb |31 |80000 |
|10004|john|33 |85000 |
|10005|sam |30 |75000 |
+-----+----+---+------+

Converting the entire row of df into one new column "jsonCol",

>>> newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))
>>> newDf1.show(10,False)
+-----+----+---+------+--------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                 |
+-----+----+---+------+--------------------------------------------------------+
|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}|
|10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} |
|10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} |
|10004|john|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}|
|10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |
+-----+----+---+------+--------------------------------------------------------+

Instead of converting the entire row into a JSON string like in the above step I needed a solution to select only few columns based on the value of the field. I have provided a sample condition in the below command.

But when I started using the when function, the resultant JSON string's column names(keys) are gone. Only getting column names by their position, instead of the actual column names(keys)

>>> newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!="  ",df[x]).otherwise(None) for x in df.columns])))
>>> newDf2.show(10,False)
+-----+----+---+------+---------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                  |
+-----+----+---+------+---------------------------------------------------------+
|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}|
|10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} |
|10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} |
|10004|john|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}|
|10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |
+-----+----+---+------+---------------------------------------------------------+

I needed to use the when function but to have the results as in newDf1 with actual column names(keys). Can someone help me out?

解决方案

You have used conditions inside struct function as columns and the condition columns are renamed as col1 col2 .... and thats why you need alias to change the names

from pyspark.sql import functions as F
newDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!="  ",df[x]).otherwise(None).alias(x) for x in df.columns])))
newDf2.show(truncate=False)

这篇关于PySpark 数据框 to_json() 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆