PySpark行对象:通过变量名称访问行元素 [英] PySpark Row objects: accessing row elements by variable names

查看:61
本文介绍了PySpark行对象:通过变量名称访问行元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个人可以使用点符号访问PySpark Row 元素:给定 r = Row(name ="Alice",age = 11),一个人可以获取名称或年龄分别使用 r.name r.age .当需要获取名称存储在变量 element 中的元素时会发生什么?一种选择是执行 r.toDict()[element] .但是,请考虑以下情况:我们有一个很大的 DataFrame ,我们希望在该数据帧的每一行上映射一个函数.我们当然可以做类似

One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11), one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element? One option is to do r.toDict()[element]. However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like

def f(row, element1, element2):
    row = row.asDict()
    return ", ".join(str(row[element1]), str(row[element2]))

result = dataframe.map(lambda row: f(row, 'age', 'name'))

但是,似乎每行调用 toDict()效率很低.有更好的方法吗?

However, it seems that calling toDict() on every row will be very inefficient. Is there a better way?

推荐答案

和Python一样,如果某些方法可行,那么这里就没有魔术了.当某些东西起作用时,例如此处的点语法,它意味着事件的可预测性.特别是,您可以期望 __ getattr __ 方法将被调用:

As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__ method will be called:

from pyspark.sql import Row

a_row = Row(foo=1, bar=True)

a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True

行也将覆盖 __ getitem __ 以具有相同的行为:

Row also overrides __getitem__ to have the same behavior:

a_row.__getitem__("foo")
## 1

这意味着您可以使用方括号符号:

It means you can use bracket notation:

a_row["bar"]
## True

问题在于效率不高.每次调用都是 O(N),因此如果您有宽行和多个调用,则单次转换为 dict 会更有效.

Problem is that it is not efficient. Each call is O(N) so a single conversion to dict can be more efficient if you have wide rows and multiple calls.

通常,您应该避免这样的呼叫:

In general you should avoid calls like this:

  • 使用UDF效率低下,但总体上更清洁
  • 使用内置SQL表达式比 map
  • 更可取
  • 您不应直接在 DataFrame 上进行映射.它很快就会被弃用.
  • using UDF is as inefficient but much cleaner in general
  • using built-in SQL expressions should be preferred over map
  • you shouldn't map directly over DataFrame. Its gonna be deprecated soon.

这篇关于PySpark行对象:通过变量名称访问行元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆