Pypsark-使用collect_list时保留空值 [英] Pypsark - Retain null values when using collect_list

查看:557
本文介绍了Pypsark-使用collect_list时保留空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据

According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true.

但就我而言,我需要保留null列-如何实现此目的?

But in my case, I need to keep the null columns -- How can I achieve this?

我没有找到有关collect_list函数这种变体的任何信息.

I did not find any info on this kind of a variant of collect_list function.

背景信息来解释为什么我想要空值:

我有一个数据框df,如下所示:

I have a dataframe df as below:

cId   |  eId  |  amount  |  city
1     |  2    |   20.0   |  Paris
1     |  2    |   30.0   |  Seoul
1     |  3    |   10.0   |  Phoenix
1     |  3    |   5.0    |  null

我想通过以下映射将其写入Elasticsearch索引:

I want to write this to an Elasticsearch index with the following mapping:

"mappings": {
    "doc": {
        "properties": {
            "eId": { "type": "keyword" },
            "cId": { "type": "keyword" },
            "transactions": {
                "type": "nested", 
                "properties": {
                    "amount": { "type": "keyword" },
                    "city": { "type": "keyword" }
                }
            }
        }
    }
 }      

为了符合上面的嵌套映射,我对df进行了转换,以便对于eId和cId的每种组合,我都有一个事务数组,如下所示:

In order to conform to the nested mapping above, I transformed my df so that for each combination of eId and cId, I have an array of transactions like this:

df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
 |-- cId: integer (nullable = true)
 |-- eId: integer (nullable = true)
 |-- transactions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: float (nullable = true)
 |    |    |-- city: string (nullable = true)

df_nested保存为json文件,有我得到的json记录:

Saving df_nested as a json file, there are the json records that I get:

{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}

如您所见-当cId=1eId=3时,我的数组元素之一amount=30.0不具有city属性,因为这是我的原始数据(df)中的null .当我使用collect_list函数时,空值将被删除.

As you can see - when cId=1 and eId=3, one of my array elements where amount=30.0 does not have the city attribute because this was a null in my original data (df). The nulls are being removed when I use the collect_list function.

但是,当我尝试使用上述索引将df_nested写入elasticsearch时,它会出错,因为存在架构不匹配的情况.这基本上是为什么我想在应用collect_list函数后保留空值的原因.

However, when I try writing df_nested to elasticsearch with the above index, it errors because there is a schema mismatch. This is basically the reason as to why I want to retain my nulls after applying the collect_list function.

推荐答案

这应该为您提供所需的东西:

This should give you what you need:

from pyspark.sql.functions import create_map, collect_list, lit, col, to_json

df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"], 
    [1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]], 
    ["cId", "eId", "amount", "city"])

df_nested = df.withColumn(
        "transactions", 
         create_map(lit("city"), col("city"), lit("amount"), col("amount")))\
    .groupBy("eId","cId")\
    .agg(collect_list("transactions").alias("transactions"))

那给了我

+---+---+------------------------------------------------------------------+
|eId|cId|transactions                                                      |
+---+---+------------------------------------------------------------------+
|2  |1  |[[city -> Paris, amount -> 20.0], [city -> Seoul, amount -> 30.0]]|
|3  |1  |[[city -> Phoenix, amount -> 10.0], [city ->, amount -> 5.0]]     |
+---+---+------------------------------------------------------------------+

然后您感兴趣的列的json是您想要的:

Then the json for your column of interest is as you want it to be:

>>> for row in df_nested.select(to_json("transactions").alias("json")).collect():
print(row["json"])

[{"city":"Paris","amount":"20.0"},{"city":"Seoul","amount":"30.0"}]
[{"city":"Phoenix","amount":"10.0"},{"city":null,"amount":"5.0"}]

这篇关于Pypsark-使用collect_list时保留空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆