转换列并更新DataFrame [英] Transforming a column and update the DataFrame
问题描述
所以,我下面要做的是从DataFrame
删除列A
,因为我想应用转换(这里我只是json.loads
JSON字符串),然后将旧列替换为转换后的列一.转换后,我只加入了两个结果数据框.
So, what I'm doing below is I drop a column A
from a DataFrame
because I want to apply a transformation (here I just json.loads
a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
我对此不喜欢的事情当然是我所面临的开销,因为我必须执行withColumnRenamed
操作.
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed
operations.
有了大熊猫,我会做这样的事情:
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
,但以下内容在pyspark中不起作用:
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
因此,有比我在第一个代码片段中做的事情更简单的方法吗?
So is there an easier way than what I'm doing in my first code snipped?
推荐答案
我认为您不需要删除该列并进行连接.以下代码应 * 等同于您发布的内容:
I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
* 我尚未实际测试此代码,但我认为这应该可行.
但是要回答您的一般问题,您可以使用withColumn()
就地转换列.
But to answer your general question, you can transform a column in-place using withColumn()
.
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
my_transformation_function()
可以是udf
或pyspark sql function
.
这篇关于转换列并更新DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!