PySpark DataFrame 列参考:df.col vs. df['col'] vs. F.col('col')? [英] PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

查看:42
本文介绍了PySpark DataFrame 列参考:df.col vs. df['col'] vs. F.col('col')?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有个概念希望你能帮忙澄清一下:

I have a concept I hope you can help to clarify:

以下三种引用 PySpark 数据帧中的列的方式有什么区别.我知道不同的情况需要不同的形式,但不知道为什么.

What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why.

  1. df.col:例如F.count(df.col)
  2. df['col']:例如df['col'] == 0
  3. F.col('col'):例如df.filter(F.col('col').isNull())
  1. df.col: e.g. F.count(df.col)
  2. df['col']: e.g. df['col'] == 0
  3. F.col('col'): e.g. df.filter(F.col('col').isNull())

非常感谢!

推荐答案

在大多数实际应用中,几乎没有区别.但是,它们是通过调用不同的底层函数来实现的(source),因此不完全相同.

In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

我们可以用一个小例子来说明:

We can illustrate with a small example:

df = spark.createDataFrame(
    [(1,'a', 0), (2,'b',None), (None,'c',3)], 
    ['col', '2col', 'third col']
)

df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#|   1|   a|        0|
#|   2|   b|     null|
#|null|   c|        3|
#+----+----+---------+


1.df.col

这是最不灵活的.您只能引用可使用 . 运算符访问的有效列.这排除了包含空格或特殊字符的列名以及以整数开头的列名.


1. df.col

This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

此语法调用 df.__getattr__("col").

print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
#        >>> df.select(df.age).collect()
#        [Row(age=2), Row(age=5)]
#
#        .. versionadded:: 1.3

使用 . 语法,您只能访问此示例数据框的第一列.

Using the . syntax, you can only access the first column of this example dataframe.

>>> df.2col
  File "<ipython-input-39-8e82c2dd5b7c>", line 1
    df.2col
       ^
SyntaxError: invalid syntax

在幕后,它会检查列名是否包含在 df.columns 中,然后返回指定的 pyspark.sql.Column.

Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

这会调用 df.__getitem__.您有更大的灵活性,因为您可以执行 __getattr__ 可以执行的所有操作,此外还可以指定任何列名.

This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

df["2col"]
#Column<2col> 

再一次,在幕后检查一些条件,在这种情况下,返回输入字符串指定的 pyspark.sql.Column.

Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

此外,您可以传入多个列(作为 listtuple)或列表达式.

In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#|   1|              false|
#|   2|               true|
#|null|              false|
#+----+-------------------+

请注意,在多列的情况下,__getitem__ 只是调用 pyspark.sql.DataFrame.select.

Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

最后,您还可以通过索引访问列:

Finally, you can also access columns by index:

df[2]
#Column<third col>

3.pyspark.sql.functions.col

这是选择列的 Spark 原生方式,并返回一个 expression(所有列函数都是这种情况),它根据给定的名称选择列.当您需要指定需要列而不是字符串文字时,这是非常有用的速记.

3. pyspark.sql.functions.col

This is the Spark native way of selecting a column and returns a expression (this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

例如,假设我们想要创建一个新列,该列将基于 col"third col" 的值2col" 的值:

For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

from pyspark.sql.functions import when

df.withColumn(
    'new', 
    f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col|      new|
#+----+----+---------+---------+
#|   1|   a|        0|third col|
#|   2|   b|     null|      col|
#|null|   c|        3|third col|
#+----+----+---------+---------+

糟糕,我不是这个意思.Spark 认为我想要文字字符串 "col""third col".相反,我应该写的是:

Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

from pyspark.sql.functions import col
df.withColumn(
    'new', 
    when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#|   1|   a|        0|  0|
#|   2|   b|     null|  2|
#|null|   c|        3|  3|
#+----+----+---------+---+

因为 col() 在没有检查的情况下创建了列表达式,这有两个有趣的副作用.

Because is col() creates the column expression without checking there's two interesting side effects of this.

  1. 它可以重复使用,因为它不是特定于 df 的
  2. 可以在分配df之前使用

age = col('dob') / 365
if_expr = when(age < 18, 'underage').otherwise('adult')

df1 = df.read.csv(path).withColumn('age_category', if_expr)

df2 = df.read.parquet(path)
    .select('*', age.alias('age'), if_expr.alias('age_category'))

age 生成Column
if_expr 生成 Column

这篇关于PySpark DataFrame 列参考:df.col vs. df['col'] vs. F.col('col')?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆