使用 monotonically_increasing_id() 为 pyspark 数据帧分配行号 [英] Using monotonically_increasing_id() for assigning row number to pyspark dataframe

查看:36
本文介绍了使用 monotonically_increasing_id() 为 pyspark 数据帧分配行号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 monotonically_increasing_id() 使用以下语法将行号分配给 pyspark 数据框:

I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below:

df1 = df1.withColumn("idx", monotonically_increasing_id())

现在 df1 有 26,572,528 条记录.所以我期待 idx 值在 0-26,572,527 之间.

Now df1 has 26,572,528 records. So I was expecting idx value from 0-26,572,527.

但是当我选择 max(idx) 时,它的值出奇地大:335,008,054,165.

But when I select max(idx), its value is strangely huge: 335,008,054,165.

这个函数是怎么回事?使用此函数与另一个具有相似记录数的数据集合并是否可靠?

What's going on with this function? is it reliable to use this function for merging with another dataset having similar number of records?

我有大约 300 个数据帧,我想将它们组合成一个数据帧.因此,一个数据帧包含 ID,而其他数据帧包含与它们逐行对应的不同记录

I have some 300 dataframes which I want to combine into a single dataframe. So one dataframe contains IDs and others contain different records corresponding to them row-wise

推荐答案

编辑:可以找到执行此操作的方法和风险的完整示例 这里

Edit: Full examples of the ways to do this and the risks can be found here

来自 文档

生成单调递增的 64 位整数的列.

A column that generates monotonically increasing 64-bit integers.

生成的 ID 保证单调递增且唯一,但不连续.当前实现将分区 ID 放在高 31 位,将每个分区内的记录号放在低 33 位.假设数据帧有少于10亿个分区,每个分区少于80亿条记录.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

因此,它不像 RDB 中的自动递增 id,并且对于合并来说可靠.

Thus, it is not like an auto-increment id in RDBs and it is not reliable for merging.

如果您需要像 RDB 那样的自动递增行为并且您的数据是可排序的,那么您可以使用 row_number

If you need an auto-increment behavior like in RDBs and your data is sortable, then you can use row_number

df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
|  1|   ....... |
|  2|   ....... |
|  3| ..........|
+---+-----------+

如果您的数据不可排序并且您不介意使用 rdds 创建索引然后回退到数据帧,您可以使用 rdd.zipWithIndex()

If your data is not sortable and you don't mind using rdds to create the indexes and then fall back to dataframes, you can use rdd.zipWithIndex()

可以在此处

简而言之:

# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()

df.show()

# your data           | indexes
+---------------------+---+
|         _1          | _2| 
+-----------=---------+---+
|[data col1,data col2]|  0|
|[data col1,data col2]|  1|
|[data col1,data col2]|  2|
+---------------------+---+

在那之后,您可能需要进行更多的转换,才能使数据框达到您需要的状态.注意:不是一个非常高效的解决方案.

You will probably need some more transformations after that to get your dataframe to what you need it to be. Note: not a very performant solution.

希望这会有所帮助.祝你好运!

Hope this helps. Good luck!

想想看,你可以结合monotonically_increasing_id来使用row_number:

Come to think about it, you can combine the monotonically_increasing_id to use the row_number:

# create a monotonically increasing id 
df = df.withColumn("idx", monotonically_increasing_id())

# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')

但不确定性能.

这篇关于使用 monotonically_increasing_id() 为 pyspark 数据帧分配行号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆