Pyspark将顺序和确定性索引添加到数据框 [英] Pyspark add sequential and deterministic index to dataframe

查看:253
本文介绍了Pyspark将顺序和确定性索引添加到数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要向具有三个非常简单约束的数据框添加索引列:

I need to add an index column to a dataframe with three very simple constraints:

  • 从0开始

  • start from 0

是连续的

具有确定性

我敢肯定我遗漏了一些明显的东西,因为对于这样一个简单的任务,或者使用非顺序,不确定性越来越单调的id,我发现的示例看起来非常复杂.我不想使用index压缩,然后不必将以前分隔的列现在分开放在一个列中,因为我的数据帧在TB中,这似乎是不必要的.我不需要按任何分区,也不需要按任何顺序进行分区,而我所找到的示例可以做到这一点(使用窗口函数和row_number).我需要的只是一个简单的0到df.count整数序列.我在这里想念什么?

I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?

2 3 5

推荐答案

我的意思是:如何添加一列,该列的有序单调递增1序列0:df.count? (来自评论)

您可以在此处使用row_number(),但是为此您需要指定orderBy().由于没有排序列,因此只需使用monotonically_increasing_id().

You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

df = df.withColumn(
    "index",
    row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)

此外,row_number()从1开始,因此您必须减去1才能使其从0开始.最后一个值将是df.count - 1.

Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.

我不想使用index压缩,然后不得不将以前分开的列(现在位于单列中)分开

I don't want to zip with index and then have to separate the previously separated columns that are now in a single column

如果可以调用map,则可以使用zipWithIndex,以避免将所有分离的列都转换为单列:

You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:

cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols

这篇关于Pyspark将顺序和确定性索引添加到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆