“'DataFrame'对象没有属性'apply'"尝试应用lambda创建新列时 [英] "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column
问题描述
我的目标是在Pandas DataFrame中添加一个新列,但我遇到了一个奇怪的错误.
I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error.
预计新列将是对现有列的转换,可以在字典/哈希图中进行查找.
The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap.
# Loading data
df = sqlContext.read.format(...).load(train_df_path)
# Instanciating the map
some_map = {
'a': 0,
'b': 1,
'c': 1,
}
# Creating a new column using the map
df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1)
这会导致以下错误:
AttributeErrorTraceback (most recent call last)
<ipython-input-12-aeee412b10bf> in <module>()
25 df= train_df
26
---> 27 df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1)
/usr/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
962 if name not in self.columns:
963 raise AttributeError(
--> 964 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
965 jc = self._jdf.apply(name)
966 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'apply'
其他可能有用的信息: *我正在使用Spark和Python 2.
Other potentially useful info: * I am using Spark and Python 2.
推荐答案
您使用的语法是针对pandas
DataFrame的.要针对spark
DataFrame实现此目的,应使用withColumn()
方法.这适用于各种定义良好的 DataFrame函数,但是对于用户定义的映射函数而言,它要复杂一些.
The syntax you are using is for a pandas
DataFrame. To achieve this for a spark
DataFrame, you should use the withColumn()
method. This works great for a wide range of well defined DataFrame functions, but it's a little more complicated for user defined mapping functions.
为了定义udf
,您需要指定输出数据类型.例如,如果要应用返回string
的函数my_func
,则可以按如下方式创建udf
:
In order to define a udf
, you need to specify the output data type. For instance, if you wanted to apply a function my_func
that returned a string
, you could create a udf
as follows:
import pyspark.sql.functions as f
my_udf = f.udf(my_func, StringType())
然后,您可以使用my_udf
创建一个新列,例如:
Then you can use my_udf
to create a new column like:
df = df.withColumn('new_column', my_udf(f.col("some_column_name")))
另一种选择是使用select
:
df = df.select("*", my_udf(f.col("some_column_name")).alias("new_column"))
特定问题
使用udf
Specific Problem
Using a udf
在您的特定情况下,您想使用字典来转换DataFrame的值.
In your specific case, you want to use a dictionary to translate the values of your DataFrame.
这是为此目的定义udf
的一种方式:
Here is a way to define a udf
for this purpose:
some_map_udf = f.udf(lambda x: some_map.get(x, None), IntegerType())
请注意,我使用了dict.get()
,因为您希望您的udf
对不良输入具有鲁棒性.
Notice that I used dict.get()
because you want your udf
to be robust to bad inputs.
df = df.withColumn('new_column', some_map_udf(f.col("some_column_name")))
使用DataFrame函数
有时不可避免地使用udf
,但是只要有可能,通常首选使用DataFrame函数.
Sometimes using a udf
is unavoidable, but whenever possible, using DataFrame functions is usually preferred.
这里是不使用udf
即可执行相同操作的一种选择.
Here is one option to do the same thing without using the udf
.
The trick is to iterate over the items in some_map
to create a list of pyspark.sql.functions.when()
functions.
some_map_func = [f.when(f.col("some_column_name") == k, v) for k, v in some_map.items()]
print(some_map_func)
#[Column<CASE WHEN (some_column_name = a) THEN 0 END>,
# Column<CASE WHEN (some_column_name = c) THEN 1 END>,
# Column<CASE WHEN (some_column_name = b) THEN 1 END>]
Now you can use pyspark.sql.functions.coalesce()
inside of a select:
df = df.select("*", f.coalesce(*some_map_func).alias("some_column_name"))
之所以可行,是因为如果不满足条件,默认情况下when()
返回null
,并且coalesce()
将选择遇到的第一个非空值.由于地图的键是唯一的,因此最多一列将不会为空.
This works because when()
returns null
by default if the condition is not met, and coalesce()
will pick the first non-null value it encounters. Since the keys of the map are unique, at most one column will be non-null.
这篇关于“'DataFrame'对象没有属性'apply'"尝试应用lambda创建新列时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!