Spark-单调增加的id在数据帧中没有按预期工作? [英] Spark-Monotonically increasing id not working as expected in dataframe?

查看:23
本文介绍了Spark-单调增加的id在数据帧中没有按预期工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Spark 中有一个数据帧 df,它看起来像这样:

I have a dataframe df in Spark which looks something like this:

scala> df.show()
+--------+--------+
|columna1|columna2|
+--------+--------+
|     0.1|     0.4|
|     0.2|     0.5|
|     0.1|     0.3|
|     0.3|     0.6|
|     0.2|     0.7|
|     0.2|     0.8|
|     0.1|     0.7|
|     0.5|     0.5|
|     0.6|    0.98|
|     1.2|     1.1|
|     1.2|     1.2|
|     0.4|     0.7|
+--------+--------+

我尝试使用以下代码包含一个 id 列

I tried to include an id column with the following code

val df_id = df.withColumn("id",monotonicallyIncreasingId)

但 id 列不是我所期望的:

but the id column is not what I expect:

scala> df_id.show()
+--------+--------+----------+
|columna1|columna2|        id|
+--------+--------+----------+
|     0.1|     0.4|         0|
|     0.2|     0.5|         1|
|     0.1|     0.3|         2|
|     0.3|     0.6|         3|
|     0.2|     0.7|         4|
|     0.2|     0.8|         5|
|     0.1|     0.7|8589934592|
|     0.5|     0.5|8589934593|
|     0.6|    0.98|8589934594|
|     1.2|     1.1|8589934595|
|     1.2|     1.2|8589934596|
|     0.4|     0.7|8589934597|
+--------+--------+----------+

正如你所看到的,它从 0 到 5 都很顺利,但接下来的 id 是 8589934592 而不是 6 等等.

As you can see, it goes well from 0 to 5 but then the next id is 8589934592 instead of 6 and so on.

那么这里出了什么问题?为什么这里的 id 列没有正确索引?

So what is wrong here? Why is the id column not properly indexed here?

推荐答案

它按预期工作.此函数不适用于生成连续值.相反,它编码分区号和按分区索引

It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition

生成的 ID 保证单调递增且唯一,但不连续.当前实现将分区 ID 放在高 31 位,将每个分区内的记录号放在低 33 位.假设数据帧的分区数少于 10 亿,每个分区的记录数少于 80 亿.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

举个例子,考虑一个有两个分区的 DataFrame,每个分区有 3 条记录.此表达式将返回以下 ID:

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L <<33), 8589934593, 8589934594.

如果你想要连续的数字,使用RDD.zipWithIndex.

If you want consecutive numbers, use RDD.zipWithIndex.

这篇关于Spark-单调增加的id在数据帧中没有按预期工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆