“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时 [英] "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column

查看：25 发布时间：2021/11/14 22:44:04 python pyspark apache-spark-sql pyspark-sql

本文介绍了“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的目标是在 Pandas DataFrame 中添加一个新列，但我遇到了一个奇怪的错误.

I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error.

新列应该是现有列的转换，可以在字典/哈希图中进行查找.

The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap.

# Loading data
df = sqlContext.read.format(...).load(train_df_path)

# Instanciating the map
some_map = {
    'a': 0, 
    'b': 1,
    'c': 1,
}

# Creating a new column using the map
df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1)

导致以下错误:

AttributeErrorTraceback (most recent call last)
<ipython-input-12-aeee412b10bf> in <module>()
     25 df= train_df
     26 
---> 27 df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1)

/usr/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
    962         if name not in self.columns:
    963             raise AttributeError(
--> 964                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    965         jc = self._jdf.apply(name)
    966         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'apply'

其他可能有用的信息:* 我使用的是 Spark 和 Python 2.

Other potentially useful info: * I am using Spark and Python 2.

推荐答案

您使用的语法用于 pandas 数据帧.要为 spark DataFrame 实现这一点，您应该使用 withColumn() 方法.这适用于各种定义明确的 DataFrame 函数，但对于用户定义的映射函数来说稍微复杂一些.

The syntax you are using is for a pandas DataFrame. To achieve this for a spark DataFrame, you should use the withColumn() method. This works great for a wide range of well defined DataFrame functions, but it's a little more complicated for user defined mapping functions.

为了定义udf，您需要指定输出数据类型.例如，如果你想应用一个返回 string 的函数 my_func，你可以创建一个 udf 如下:

In order to define a udf, you need to specify the output data type. For instance, if you wanted to apply a function my_func that returned a string, you could create a udf as follows:

import pyspark.sql.functions as f
my_udf = f.udf(my_func, StringType())

然后您可以使用 my_udf 创建一个新列，例如:

Then you can use my_udf to create a new column like:

df = df.withColumn('new_column', my_udf(f.col("some_column_name")))

另一种选择是使用select:

df = df.select("*", my_udf(f.col("some_column_name")).alias("new_column"))

<小时>

具体问题

使用udf

在您的特定情况下，您希望使用字典来翻译 DataFrame 的值.

In your specific case, you want to use a dictionary to translate the values of your DataFrame.

这是一种为此目的定义 udf 的方法:

Here is a way to define a udf for this purpose:

some_map_udf = f.udf(lambda x: some_map.get(x, None), IntegerType())

请注意，我使用 dict.get() 是因为您希望您的 udf 对错误输入具有鲁棒性.

Notice that I used dict.get() because you want your udf to be robust to bad inputs.

df = df.withColumn('new_column', some_map_udf(f.col("some_column_name")))

使用 DataFrame 函数

有时使用 udf 是不可避免的，但只要有可能，通常首选使用 DataFrame 函数.

Sometimes using a udf is unavoidable, but whenever possible, using DataFrame functions is usually preferred.

这里有一个不使用 udf 来做同样事情的选项.

Here is one option to do the same thing without using the udf.

诀窍是遍历 some_map 中的项目以创建 pyspark.sql.functions.when() 函数.

The trick is to iterate over the items in some_map to create a list of pyspark.sql.functions.when() functions.

some_map_func = [f.when(f.col("some_column_name") == k, v) for k, v in some_map.items()]
print(some_map_func)
#[Column<CASE WHEN (some_column_name = a) THEN 0 END>,
# Column<CASE WHEN (some_column_name = c) THEN 1 END>,
# Column<CASE WHEN (some_column_name = b) THEN 1 END>]

现在你可以使用 pyspark.sql.functions.coalesce() 在选择中:

Now you can use pyspark.sql.functions.coalesce() inside of a select:

df = df.select("*", f.coalesce(*some_map_func).alias("some_column_name"))

这是可行的，因为如果不满足条件，when() 默认返回 null，并且 coalesce() 将选择第一个非-它遇到的空值.由于映射的键是唯一的，因此最多有一列非空.

This works because when() returns null by default if the condition is not met, and coalesce() will pick the first non-null value it encounters. Since the keys of the map are unique, at most one column will be non-null.

这篇关于“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时 [英] "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column

问题描述

推荐答案

具体问题

相关文章

Python最新文章

热门教程

热门工具

登录关闭

“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时 [英] &quot;&#39;DataFrame&#39; object has no attribute &#39;apply&#39;&quot; when trying to apply lambda to create new column

问题描述

推荐答案

具体问题

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

“‘DataFrame’对象没有‘apply’属性"；尝试应用 lambda 来创建新列时 [英] "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column

登录关闭