使用monotonically_increasing_id()为pyspark数据框分配行号 [英] Using monotonically_increasing_id() for assigning row number to pyspark dataframe

查看:431
本文介绍了使用monotonically_increasing_id()为pyspark数据框分配行号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用monotonically_increasing_id()使用以下语法将行号分配给pyspark数据帧:

I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below:

df1 = df1.withColumn("idx", monotonically_increasing_id())

现在df1有26,572,528条记录.所以我期望idx值在0-26,572,527之间.

Now df1 has 26,572,528 records. So I was expecting idx value from 0-26,572,527.

但是当我选择max(idx)时,它的值非常大:335,008,054,165.

But when I select max(idx), its value is strangely huge: 335,008,054,165.

此功能是怎么回事? 使用此功能与记录数量相似的另一个数据集进行合并是否可靠?

What's going on with this function? is it reliable to use this function for merging with another dataset having similar number of records?

我有大约300个数据框,我想将它们组合成一个数据框.因此,一个数据框包含ID,其他数据框包含与之对应的不同记录

I have some 300 dataframes which I want to combine into a single dataframe. So one dataframe contains IDs and others contain different records corresponding to them row-wise

推荐答案

来自

生成单调递增的64位整数的列.

A column that generates monotonically increasing 64-bit integers.

保证生成的ID单调递增且唯一,但不连续.当前实现将分区ID放在高31位中,将记录号放在每个分区的低33位中.假设数据帧的分区少于10亿,每个分区的记录少于80亿.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

因此,它不像RDB中的自动增量ID,并且不可靠用于合并.

Thus, it is not like an auto-increment id in RDBs and it is not reliable for merging.

如果您需要像RDB中那样的自动递增行为,并且您的数据是可排序的,则可以使用row_number

If you need an auto-increment behavior like in RDBs and your data is sortable, then you can use row_number

df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
|  1|   ....... |
|  2|   ....... |
|  3| ..........|
+---+-----------+

如果您的数据无法排序,并且您不介意使用rdds创建索引然后又退回到数据框,则可以使用rdd.zipWithIndex()

If your data is not sortable and you don't mind using rdds to create the indexes and then fall back to dataframes, you can use rdd.zipWithIndex()

可以在此处

简而言之:

# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()

df.show()

# your data           | indexes
+---------------------+---+
|         _1          | _2| 
+-----------=---------+---+
|[data col1,data col2]|  0|
|[data col1,data col2]|  1|
|[data col1,data col2]|  2|
+---------------------+---+

之后,您可能需要更多的转换才能使数据框达到所需的状态.注意:这不是一个非常有效的解决方案.

You will probably need some more transformations after that to get your dataframe to what you need it to be. Note: not a very performant solution.

希望这会有所帮助.祝你好运!

Hope this helps. Good luck!

修改: 仔细考虑一下,您可以结合使用monotonically_increasing_id来使用row_number:

Come to think about it, you can combine the monotonically_increasing_id to use the row_number:

# create a monotonically increasing id 
df = df.withColumn("idx", monotonically_increasing_id())

# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')

虽然不确定性能.

这篇关于使用monotonically_increasing_id()为pyspark数据框分配行号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆