Spark单调增加的id在数据帧中无法按预期工作吗? [英] Spark-Monotonically increasing id not working as expected in dataframe?

查看:139
本文介绍了Spark单调增加的id在数据帧中无法按预期工作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark中有一个数据框df,看起来像这样:

I have a dataframe df in Spark which looks something like this:

scala> df.show()
+--------+--------+
|columna1|columna2|
+--------+--------+
|     0.1|     0.4|
|     0.2|     0.5|
|     0.1|     0.3|
|     0.3|     0.6|
|     0.2|     0.7|
|     0.2|     0.8|
|     0.1|     0.7|
|     0.5|     0.5|
|     0.6|    0.98|
|     1.2|     1.1|
|     1.2|     1.2|
|     0.4|     0.7|
+--------+--------+

我试图在id列中添加以下代码

I tried to include an id column with the following code

val df_id = df.withColumn("id",monotonicallyIncreasingId)

但是id列不是我所期望的:

but the id column is not what I expect:

scala> df_id.show()
+--------+--------+----------+
|columna1|columna2|        id|
+--------+--------+----------+
|     0.1|     0.4|         0|
|     0.2|     0.5|         1|
|     0.1|     0.3|         2|
|     0.3|     0.6|         3|
|     0.2|     0.7|         4|
|     0.2|     0.8|         5|
|     0.1|     0.7|8589934592|
|     0.5|     0.5|8589934593|
|     0.6|    0.98|8589934594|
|     1.2|     1.1|8589934595|
|     1.2|     1.2|8589934596|
|     0.4|     0.7|8589934597|
+--------+--------+----------+

如您所见,它从0到5很好,但是下一个ID是8589934592而不是6,依此类推.

As you can see, it goes well from 0 to 5 but then the next id is 8589934592 instead of 6 and so on.

那么这里出什么问题了?为什么id列在此处未正确索引?

So what is wrong here? Why is the id column not properly indexed here?

推荐答案

它按预期工作.此函数不适用于生成连续值.而是对分区号进行编码并按分区索引

It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition

保证生成的ID单调递增且唯一,但不连续.当前实现将分区ID放在高31位中,将记录号放在每个分区的低33位中.假设数据帧的分区少于10亿,每个分区的记录少于80亿.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

作为一个示例,考虑一个具有两个分区的DataFrame,每个分区具有3个记录.该表达式将返回以下ID:

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

如果需要连续的数字,请使用RDD.zipWithIndex.

If you want consecutive numbers, use RDD.zipWithIndex.

这篇关于Spark单调增加的id在数据帧中无法按预期工作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆