PySpark DataFrame列参考:df.col与df ['col']与F.col('col')? [英] PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

查看:1635
本文介绍了PySpark DataFrame列参考:df.col与df ['col']与F.col('col')?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个概念,希望您能帮助澄清:

I have a concept I hope you can help to clarify:

以下三种引用PySpark数据框中的列的方式有什么区别?我知道不同的情况需要不同的形式,但不知道为什么。

What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why.


  1. df.col :例如
    F.count(df.col)

  2. df ['col'] :例如
    df ['col'] == 0

  3. F.col('col'):例如
    df.filter(F.col('col')。isNull())

  1. df.col: e.g. F.count(df.col)
  2. df['col']: e.g. df['col'] == 0
  3. F.col('col'): e.g. df.filter(F.col('col').isNull())

非常感谢!

推荐答案

在大多数实际应用中,都有几乎没有差别。但是,它们是通过调用不同的基础函数来实现的(),因此不完全相同

In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

我们可以举一个小例子来说明:

We can illustrate with a small example:

df = spark.createDataFrame(
    [(1,'a', 0), (2,'b',None), (None,'c',3)], 
    ['col', '2col', 'third col']
)

df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#|   1|   a|        0|
#|   2|   b|     null|
#|null|   c|        3|
#+----+----+---------+




1。 df.col


这是最不灵活的。您只能引用使用运算符可以访问的有效列。这排除了包含空格或特殊字符的列名以及以整数开头的列名。


1. df.col

This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

此语法调用 df。__getattr __( col; )

print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
#        >>> df.select(df.age).collect()
#        [Row(age=2), Row(age=5)]
#
#        .. versionadded:: 1.3

使用语法,您只能访问其第一列

Using the . syntax, you can only access the first column of this example dataframe.

>>> df.2col
  File "<ipython-input-39-8e82c2dd5b7c>", line 1
    df.2col
       ^
SyntaxError: invalid syntax

在后台,它检查列名是否包含在 df.columns ,然后返回指定的 pyspark.sql.Column

Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

这将调用 df。__getitem __ 。您具有更大的灵活性,因为您可以做 __ getattr __ 可以做的所有事情,还可以指定任何列名。

This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

df["2col"]
#Column<2col> 

再次检查一些条件,在本例中为 pyspark.sql。返回由输入字符串指定的列

Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

此外,您还可以传入多个列(作为列表 tuple )或列表达式。

In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#|   1|              false|
#|   2|               true|
#|null|              false|
#+----+-------------------+

请注意,在多列的情况下, __ getitem __ 只是在调用 pyspark.sql.DataFrame.select

Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

最后,您还可以按索引访问列:

Finally, you can also access columns by index:

df[2]
#Column<third col>


3。 pyspark.sql.functions.col


这是Spark的一种选择列并返回的本地方法expression (所有列函数都是这种情况),它根据给定名称选择列。当您需要指定要使用列而不是字符串文字时,这非常有用。

3. pyspark.sql.functions.col

This is the Spark native way of selecting a column and returns a expression (this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

例如,假设我们要创建一个新列,该列将使用任一值基于的值从 col 第三列 2col

For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

from pyspark.sql.functions import when

df.withColumn(
    'new', 
    f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col|      new|
#+----+----+---------+---------+
#|   1|   a|        0|third col|
#|   2|   b|     null|      col|
#|null|   c|        3|third col|
#+----+----+---------+---------+

糟糕,这不是我的意思。 Spark认为我想要文字字符串 col third col 。相反,我应该写的是:

Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

from pyspark.sql.functions import col
df.withColumn(
    'new', 
    when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#|   1|   a|        0|  0|
#|   2|   b|     null|  2|
#|null|   c|        3|  3|
#+----+----+---------+---+

因为是col(),它创建了列表达式而没有检查这有两个有趣的副作用。

Because is col() creates the column expression without checking there's two interesting side effects of this.


  1. 它可以重复使用不特定于df

  2. 可以在分配df之前使用


age = col('dob') / 365
if_expr = when(age < 18, 'underage').otherwise('adult')

df1 = df.read.csv(path).withColumn('age_category', if_expr)

df2 = df.read.parquet(path)\
    .select('*', age.alias('age'), if_expr.alias('age_category'))

age 生成 Column< b'(dob / 365)'>

if_expr 生成列< b'CASE何时((dob / 365)< 18)然后未成年ELSE成人END'>

这篇关于PySpark DataFrame列参考:df.col与df ['col']与F.col('col')?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆