从pySpark中的字典构建一行 [英] Building a row from a dict in pySpark

查看:159
本文介绍了从pySpark中的字典构建一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在pySpark 1.6.1中动态构建一行,然后将其构建到数据帧中.一般想法是将describe的结果扩展为包括例如偏斜和峰度.我认为这应该起作用:

I'm trying to dynamically build a row in pySpark 1.6.1, then build it into a dataframe. The general idea is to extend the results of describe to include, for example, skew and kurtosis. Here's what I thought should work:

from pyspark.sql import Row

row_dict = {'C0': -1.1990072635132698,
            'C3': 0.12605772684660232,
            'C4': 0.5760856026559944,
            'C5': 0.1951877800894315,
            'C6': 24.72378589441825,
            'summary': 'kurtosis'}

new_row = Row(row_dict)

但是这会返回TypeError: sequence item 0: expected string, dict found,这是一个相当明显的错误.然后我发现,如果我首先定义了行"字段,则可以使用字典:

But this returns TypeError: sequence item 0: expected string, dict found which is a fairly clear error. Then I found that if I defined the Row fields first, I could use a dict:

r = Row('summary', 'C0', 'C3', 'C4', 'C5', 'C6')
r(row_dict)
> Row(summary={'summary': 'kurtosis', 'C3': 0.12605772684660232, 'C0': -1.1990072635132698, 'C6': 24.72378589441825, 'C5': 0.1951877800894315, 'C4': 0.5760856026559944})

这将是一个很好的步骤,只是似乎无法动态指定Row中的字段.我需要使用它来处理未知名称的未知行数.根据文档,您实际上可以采用其他方式:

Which would be a fine step, except it doesn't seem like I can dynamically specify the fields in Row. I need this to work for an unknown number of rows with unknown names. According to the documentation you can actually go the other way:

>>> Row(name="Alice", age=11).asDict() == {'name': 'Alice', 'age': 11}
True

所以看来我应该能够做到这一点.似乎还存在一些较旧版本已弃用的功能,例如

So it seems like I should be able to do this. It also appears there may be some deprecated features from older versions that allowed this, for example here. Is there a more current equivalent I'm missing?

推荐答案

您可以按如下所示使用关键字参数解包:

You can use keyword arguments unpacking as follows:

Row(**row_dict)

## Row(C0=-1.1990072635132698, C3=0.12605772684660232, C4=0.5760856026559944, 
##     C5=0.1951877800894315, C6=24.72378589441825, summary='kurtosis')

请注意,它在内部按键对数据进行排序,以解决旧版Python出现的问题.

此行为可能会在即将发布的版本中删除-请参见 SPARK-29748 删除PySpark SQL行创建中的字段排序.删除后,您必须确保dict中的值顺序在记录之间是一致的.

This behavior is likely to be removed in the upcoming releases - see SPARK-29748 Remove sorting of fields in PySpark SQL Row creation. Once it is remove you'll have to ensure that the order of values in the dict is consistent across records.

这篇关于从pySpark中的字典构建一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆