PySpark 数据框 to_json() 函数 [英] PySpark dataframe to_json() function
问题描述
我有一个如下所示的数据框,
<预><代码>>>>df.show(10,False)+-----+----+---+------+|id |姓名|年龄|薪水|+-----+----+---+------+|10001|亚历克斯|30 |75000 ||10002|鲍勃 |31 |80000 ||10003|deb |31 |80000 ||10004|约翰|33 |85000 ||10005|山姆 |30 |75000 |+-----+----+---+------+将df的整行转换成一个新列jsonCol",
<预><代码>>>>newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))>>>newDf1.show(10,False)+-----+----+---+------+---------------------------------------------------------+|id |姓名|年龄|薪水|jsonCol |+-----+----+---+------+---------------------------------------------------------+|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}||10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} ||10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} ||10004|约翰|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}||10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |+-----+----+---+------+---------------------------------------------------------+不是像上面的步骤那样将整行转换为 JSON 字符串,我需要一个解决方案来根据字段的值仅选择几列.我在以下命令中提供了示例条件.
但是当我开始使用 when 函数时,生成的 JSON 字符串的列名(键)消失了.仅按位置获取列名,而不是实际的列名(键)
<预><代码>>>>newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!=" ",df[x]).otherwise(None) for x in df.columns])))>>>newDf2.show(10,False)+-----+----+---+------+----------------------------------------------------------+|id |姓名|年龄|薪水|jsonCol |+-----+----+---+------+----------------------------------------------------------+|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}||10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} ||10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} ||10004|约翰|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}||10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |+-----+----+---+------+----------------------------------------------------------+我需要使用 when 函数,但需要使用实际的列名(键)来获得 newDf1 中的结果.有人可以帮我吗?
您在 struct
函数中使用了条件作为列,并且条件列被重命名为 col1 col2 .... 这就是为什么你需要 alias
来改变名字
from pyspark.sql import 函数为 FnewDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!=" ",df[x]).otherwise(None).alias(x) forx 在 df.columns])))newDf2.show(truncate=False)
I have a dataframe like below,
>>> df.show(10,False)
+-----+----+---+------+
|id |name|age|salary|
+-----+----+---+------+
|10001|alex|30 |75000 |
|10002|bob |31 |80000 |
|10003|deb |31 |80000 |
|10004|john|33 |85000 |
|10005|sam |30 |75000 |
+-----+----+---+------+
Converting the entire row of df into one new column "jsonCol",
>>> newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))
>>> newDf1.show(10,False)
+-----+----+---+------+--------------------------------------------------------+
|id |name|age|salary|jsonCol |
+-----+----+---+------+--------------------------------------------------------+
|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}|
|10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} |
|10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} |
|10004|john|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}|
|10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |
+-----+----+---+------+--------------------------------------------------------+
Instead of converting the entire row into a JSON string like in the above step I needed a solution to select only few columns based on the value of the field. I have provided a sample condition in the below command.
But when I started using the when function, the resultant JSON string's column names(keys) are gone. Only getting column names by their position, instead of the actual column names(keys)
>>> newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!=" ",df[x]).otherwise(None) for x in df.columns])))
>>> newDf2.show(10,False)
+-----+----+---+------+---------------------------------------------------------+
|id |name|age|salary|jsonCol |
+-----+----+---+------+---------------------------------------------------------+
|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}|
|10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} |
|10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} |
|10004|john|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}|
|10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |
+-----+----+---+------+---------------------------------------------------------+
I needed to use the when function but to have the results as in newDf1 with actual column names(keys). Can someone help me out?
You have used conditions inside struct
function as columns and the condition columns are renamed as col1 col2 .... and thats why you need alias
to change the names
from pyspark.sql import functions as F
newDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!=" ",df[x]).otherwise(None).alias(x) for x in df.columns])))
newDf2.show(truncate=False)
这篇关于PySpark 数据框 to_json() 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!