如何通过一个数组列来样式化pySpark数据帧? [英] How do I flattern a pySpark dataframe by one array column?

查看：40 发布时间：2021/4/8 19:34:24 python apache-spark pyspark

本文介绍了如何通过一个数组列来样式化pySpark数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个这样的spark数据框:

I have a spark dataframe like this:

+------+--------+--------------+--------------------+
|   dbn|    boro|total_students|                sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn|           399|[B41, B43, B44-SB...|
|09X543|   Bronx|           378|[Bx13, Bx15, Bx17...|
|09X327|   Bronx|           543|[Bx1, Bx11, Bx13,...|
+------+--------+--------------+--------------------+

我该如何补正它，以便为sBus中的每个元素的每一行复制每一行，并且sBus将是普通的字符串列?

How do I flattern it so that each row is copied for each for each element in sBus, and sBus will be a normal string column?

结果就是这样:

+------+--------+--------------+--------------------+
|   dbn|    boro|total_students|                sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn|           399| B41                |
|17K548|Brooklyn|           399| B43                |
|17K548|Brooklyn|           399| B44-SB             |
+------+--------+--------------+--------------------+

以此类推...

推荐答案

我想不出一种方法，不将其转换为RDD.

I can't think of a way to do this without turning it into an RDD.

# convert df to rdd
rdd = df.rdd

def extract(row, key):
    """Takes dictionary and key, returns tuple of (dict w/o key, dict[key])."""
    _dict = row.asDict()
    _list = _dict[key]
    del _dict[key]
    return (_dict, _list)


def add_to_dict(_dict, key, value):
    _dict[key] = value
    return _dict


# preserve rest of values in key, put list to flatten in value
rdd = rdd.map(lambda x: extract(x, 'sBus'))
# make a row for each item in value
rdd = rdd.flatMapValues(lambda x: x)
# add flattened value back into dictionary
rdd = rdd.map(lambda x: add_to_dict(x[0], 'sBus', x[1]))
# convert back to dataframe
df = sqlContext.createDataFrame(rdd)

df.show()

棘手的部分是将其他列与新展平的值保持在一起.通过将每一行映射到(其他列的字典，将要展平的列表)的元组，然后调用

The tricky part is keeping the other columns together with the newly flattened values. I do this by mapping each row to a tuple of (dict of other columns, list to flatten) and then calling flatMapValues. This will split each element of the value list into a separate row, but keep the keys attached, i.e.

(key, ['A', 'B', 'C'])

成为

(key, 'A')
(key, 'B')
(key, 'C')

然后，我将展平的值移回其他列的字典中，并将其重新转换回DataFrame.

Then, I move the flattened value back into the dictionary of other columns, and reconvert it back to a DataFrame.

这篇关于如何通过一个数组列来样式化pySpark数据帧?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过一个数组列来样式化pySpark数据帧? [英] How do I flattern a pySpark dataframe by one array column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过一个数组列来样式化pySpark数据帧? [英] How do I flattern a pySpark dataframe by one array column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭