Pyspark将顺序和确定性索引添加到数据框 [英] Pyspark add sequential and deterministic index to dataframe
问题描述
我需要向具有三个非常简单约束的数据框添加索引列:
I need to add an index column to a dataframe with three very simple constraints:
-
从0开始
start from 0
是连续的
具有确定性
我敢肯定我遗漏了一些明显的东西,因为对于这样一个简单的任务,或者使用非顺序,不确定性越来越单调的id,我发现的示例看起来非常复杂.我不想使用index压缩,然后不必将以前分隔的列现在分开放在一个列中,因为我的数据帧在TB中,这似乎是不必要的.我不需要按任何分区,也不需要按任何顺序进行分区,而我所找到的示例可以做到这一点(使用窗口函数和row_number).我需要的只是一个简单的0到df.count整数序列.我在这里想念什么?
I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?
推荐答案
我的意思是:如何添加一列,该列的有序单调递增1序列0:df.count? (来自评论)
您可以在此处使用row_number()
,但是为此您需要指定orderBy()
.由于没有排序列,因此只需使用monotonically_increasing_id()
.
You can use row_number()
here, but for that you'd need to specify an orderBy()
. Since you don't have an ordering column, just use monotonically_increasing_id()
.
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
此外,row_number()
从1开始,因此您必须减去1才能使其从0开始.最后一个值将是df.count - 1
.
Also, row_number()
starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1
.
我不想使用index压缩,然后不得不将以前分开的列(现在位于单列中)分开
I don't want to zip with index and then have to separate the previously separated columns that are now in a single column
如果可以调用map
,则可以使用zipWithIndex
,以避免将所有分离的列都转换为单列:
You can use zipWithIndex
if you follow it with a call to map
, to avoid having all of the separated columns turn into a single column:
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
这篇关于Pyspark将顺序和确定性索引添加到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!