Pyspark将顺序和确定性索引添加到数据框 [英] Pyspark add sequential and deterministic index to dataframe

查看：253 发布时间：2020/11/27 19:50:25 indexing pyspark

本文介绍了Pyspark将顺序和确定性索引添加到数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要向具有三个非常简单约束的数据框添加索引列:

I need to add an index column to a dataframe with three very simple constraints:

从0开始

start from 0

是连续的

具有确定性

我敢肯定我遗漏了一些明显的东西，因为对于这样一个简单的任务，或者使用非顺序，不确定性越来越单调的id，我发现的示例看起来非常复杂.我不想使用index压缩，然后不必将以前分隔的列现在分开放在一个列中，因为我的数据帧在TB中，这似乎是不必要的.我不需要按任何分区，也不需要按任何顺序进行分区，而我所找到的示例可以做到这一点(使用窗口函数和row_number).我需要的只是一个简单的0到df.count整数序列.我在这里想念什么?

I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?

2 ，3 ， 5

推荐答案

我的意思是:如何添加一列，该列的有序单调递增1序列0:df.count? (来自评论)

您可以在此处使用row_number()，但是为此您需要指定orderBy().由于没有排序列，因此只需使用monotonically_increasing_id().

You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

df = df.withColumn(
    "index",
    row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)

此外，row_number()从1开始，因此您必须减去1才能使其从0开始.最后一个值将是df.count - 1.

Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.

我不想使用index压缩，然后不得不将以前分开的列(现在位于单列中)分开

I don't want to zip with index and then have to separate the previously separated columns that are now in a single column

如果可以调用map，则可以使用zipWithIndex，以避免将所有分离的列都转换为单列:

You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:

cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols

这篇关于Pyspark将顺序和确定性索引添加到数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark将顺序和确定性索引添加到数据框 [英] Pyspark add sequential and deterministic index to dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark将顺序和确定性索引添加到数据框 [英] Pyspark add sequential and deterministic index to dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭