PySpark行对象:通过变量名称访问行元素 [英] PySpark Row objects: accessing row elements by variable names

查看：61 发布时间：2021/4/8 19:25:01 python apache-spark pyspark

本文介绍了PySpark行对象:通过变量名称访问行元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

一个人可以使用点符号访问PySpark Row 元素:给定 r = Row(name ="Alice"，age = 11)，一个人可以获取名称或年龄分别使用 r.name 或 r.age .当需要获取名称存储在变量 element 中的元素时会发生什么?一种选择是执行 r.toDict()[element] .但是，请考虑以下情况:我们有一个很大的 DataFrame ，我们希望在该数据帧的每一行上映射一个函数.我们当然可以做类似

One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11), one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element? One option is to do r.toDict()[element]. However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like

def f(row, element1, element2):
    row = row.asDict()
    return ", ".join(str(row[element1]), str(row[element2]))

result = dataframe.map(lambda row: f(row, 'age', 'name'))

但是，似乎每行调用 toDict()效率很低.有更好的方法吗?

However, it seems that calling toDict() on every row will be very inefficient. Is there a better way?

推荐答案

和Python一样，如果某些方法可行，那么这里就没有魔术了.当某些东西起作用时，例如此处的点语法，它意味着事件的可预测性.特别是，您可以期望 __ getattr __ 方法将被调用:

As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__ method will be called:

from pyspark.sql import Row

a_row = Row(foo=1, bar=True)

a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True

行也将覆盖 __ getitem __ 以具有相同的行为:

Row also overrides __getitem__ to have the same behavior:

a_row.__getitem__("foo")
## 1

这意味着您可以使用方括号符号:

It means you can use bracket notation:

a_row["bar"]
## True

问题在于效率不高.每次调用都是 O(N)，因此如果您有宽行和多个调用，则单次转换为 dict 会更有效.

Problem is that it is not efficient. Each call is O(N) so a single conversion to dict can be more efficient if you have wide rows and multiple calls.

通常，您应该避免这样的呼叫:

In general you should avoid calls like this:

使用UDF效率低下，但总体上更清洁
使用内置SQL表达式比 map
您不应直接在 DataFrame 上进行映射.它很快就会被弃用.

using UDF is as inefficient but much cleaner in general
using built-in SQL expressions should be preferred over map
you shouldn't map directly over DataFrame. Its gonna be deprecated soon.

这篇关于PySpark行对象:通过变量名称访问行元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark行对象:通过变量名称访问行元素 [英] PySpark Row objects: accessing row elements by variable names

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark行对象:通过变量名称访问行元素 [英] PySpark Row objects: accessing row elements by variable names

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭