PySpark行对象:通过变量名称访问行元素 [英] PySpark Row objects: accessing row elements by variable names
问题描述
一个人可以使用点符号访问PySpark Row
元素:给定 r = Row(name ="Alice",age = 11)
,一个人可以获取名称或年龄分别使用 r.name
或 r.age
.当需要获取名称存储在变量 element
中的元素时会发生什么?一种选择是执行 r.toDict()[element]
.但是,请考虑以下情况:我们有一个很大的 DataFrame
,我们希望在该数据帧的每一行上映射一个函数.我们当然可以做类似
One can access PySpark Row
elements using the dot notation: given r= Row(name="Alice", age=11)
, one can get the name or the age using r.name
or r.age
respectively. What happens when one needs to get an element whose name is stored in a variable element
? One option is to do r.toDict()[element]
. However, consider a situation where we have a large DataFrame
and we wish to map a function on each row of that data frame. We can certainly do something like
def f(row, element1, element2):
row = row.asDict()
return ", ".join(str(row[element1]), str(row[element2]))
result = dataframe.map(lambda row: f(row, 'age', 'name'))
但是,似乎每行调用 toDict()
效率很低.有更好的方法吗?
However, it seems that calling toDict()
on every row will be very inefficient. Is there a better way?
推荐答案
和Python一样,如果某些方法可行,那么这里就没有魔术了.当某些东西起作用时,例如此处的点语法,它意味着事件的可预测性.特别是,您可以期望 __ getattr __
方法将被调用:
As always in Python if something works there is no magic there. When something works, like dot syntax here, it means a predictable chain of events. In particular you can expect that __getattr__
method will be called:
from pyspark.sql import Row
a_row = Row(foo=1, bar=True)
a_row.__getattr__("foo")
## 1
a_row.__getattr__("bar")
True
行也将覆盖 __ getitem __
以具有相同的行为:
Row also overrides __getitem__
to have the same behavior:
a_row.__getitem__("foo")
## 1
这意味着您可以使用方括号符号:
It means you can use bracket notation:
a_row["bar"]
## True
问题在于效率不高.每次调用都是 O(N),因此如果您有宽行和多个调用,则单次转换为 dict
会更有效.
Problem is that it is not efficient. Each call is O(N) so a single conversion to dict
can be more efficient if you have wide rows and multiple calls.
通常,您应该避免这样的呼叫:
In general you should avoid calls like this:
- 使用UDF效率低下,但总体上更清洁
- 使用内置SQL表达式比
map
更可取 - 您不应直接在
DataFrame
上进行映射.它很快就会被弃用.
- using UDF is as inefficient but much cleaner in general
- using built-in SQL expressions should be preferred over
map
- you shouldn't map directly over
DataFrame
. Its gonna be deprecated soon.
这篇关于PySpark行对象:通过变量名称访问行元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!