转换列并更新DataFrame [英] Transforming a column and update the DataFrame

查看：50 发布时间：2020/9/4 19:17:14 pyspark spark-dataframe

本文介绍了转换列并更新DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以，我下面要做的是从DataFrame删除列A，因为我想应用转换(这里我只是json.loads JSON字符串)，然后将旧列替换为转换后的列一.转换后，我只加入了两个结果数据框.

So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.

df = df_data.drop('A').join(
    df_data[['ID', 'A']].rdd\
        .map(lambda x: (x.ID, json.loads(x.A)) 
             if x.A is not None else (x.ID, None))\
        .toDF()\
        .withColumnRenamed('_1', 'ID')\
        .withColumnRenamed('_2', 'A'),
    ['ID']
)

我对此不喜欢的事情当然是我所面临的开销，因为我必须执行withColumnRenamed操作.

The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.

有了大熊猫，我会做这样的事情:

With pandas All I'd do something like this:

pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf

，但以下内容在pyspark中不起作用:

but the following does not work in pyspark:

df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))

因此，有比我在第一个代码片段中做的事情更简单的方法吗?

So is there an easier way than what I'm doing in my first code snipped?

推荐答案

我认为您不需要删除该列并进行连接.以下代码应^*等同于您发布的内容:

I do not think you need to drop the column and do the join. The following code should^* be equivalent to what you posted:

cols = df_data.columns
df = df_data.rdd\
    .map(
        lambda row: tuple(
            [row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None) 
             for c in cols]
        )
    )\
    .toDF(cols)

^* 我尚未实际测试此代码，但我认为这应该可行.

但是要回答您的一般问题，您可以使用withColumn()就地转换列.

But to answer your general question, you can transform a column in-place using withColumn().

df = df_data.withColumn("A", my_transformation_function("A").alias("A"))

my_transformation_function()可以是udf或pyspark sql function.

这篇关于转换列并更新DataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

转换列并更新DataFrame [英] Transforming a column and update the DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

转换列并更新DataFrame [英] Transforming a column and update the DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭