二进制分区n个排序的数据指向`O(b.log(n))`中的b个桶 [英] Binary Partitioning n sorted data points in to b Buckets in `O(b.log(n))`
问题描述
给出大小为n的经过排序的可随机访问的输入数据(排序的数组),我想使用参数化的分区函数将其划分为存储桶.我希望将结果作为指向铲斗边缘的索引数组返回.
Given sorted randomly accessible input data (a sorted array) of size n, I'd like to partition this in to buckets using a parameterised partitioning function. I want the result back as an array of indexes to the bucket edges.
分区功能返回一个布尔值,以指示值是否也应位于同一分区中.
The partitioning function returns a bool to indicate if too values should be in the same partition or not.
请注意,在进行分区之前,我们不知道需要多少个存储桶:
Note that prior to partitioning, we don't know how many buckets will be required:
- 可能所有输入都应放在单个存储桶中.
- 可能是每个输入都需要自己的存储桶的情况.
- 也许第一个值需要一个存储桶,而数组的其余部分在第二个存储桶中.
...这在运行分区算法之前是未知的.
… this is not know before running the partitioning algorithm.
作为一个具体示例,可以说我们具有分区功能:
As a concrete example, lets say we have the partitioning function:
sameBucket(a, b) = (a/10 == b/10)
其中/
是整数除法(四舍五入).所以
Where /
is integer division (it rounds down). So
sameBucket(0,1) == yes
sameBucket(1,2) == yes
sameBucket(0,9) == yes
sameBucket(0,10) == no
分区功能告诉我们 0
和 10
不应放置在同一存储桶中.
The partition function tells us 0
and 10
should not be placed in to the same bucket.
为清楚起见,请考虑使用下面显示其索引的输入数组(我假设有一个名为"end the end"的索引,称为end):
Consider this input array shown with its indexes underneath for clarity (I assume a "past the end" index called end):
[1, 3, 7, 14, 90, 91, 92, 93, 95, 99]
0 1 2 3 4 5 6 7 8 9 end
对于此数据,用 ^
表示属于新存储桶的元素:
For this data, the elements that are members of a new bucket are indicated with a ^
:
[1, 3, 7, 14, 90, 91, 92, 93, 95, 99]
0 1 2 3 4 5 6 7 8 9 end
^ ^ ^ ^
如果我使用上面的分区功能,我将只获得开始一个新存储桶的索引:
If I use the partition function above, I'd get back just the indexes that begin a new bucket:
[0, 3, 4, end]
结果数组中的每个索引代表存储桶函数表示的第一个元素与其之前的元素不在同一个存储桶中.
Each index in the result array represents the first element that the bucket function says is not in the same bucket as the element before it.
- 整个数据由范围
0…< end
表示. -
0…< 3
的范围是数字[1、3、7]
,除以它们均是0
10
. - 范围
3…< 4
是单个数字14
,当除以10
1 >. - 范围
4…< end
是数字[90、91、92、93、95、99]
,它们都是0 代码>除以
10
.
- The whole data is represented by the range
0…<end
. - The range
0…<3
are the numbers[1, 3, 7]
, which are all0
, when divided by10
. - The range
3…<4
is the single number14
, which is1
when divided by10
. - The range
4…<end
are the numbers[90, 91, 92, 93, 95, 99]
, which are all0
when divided by10
.
我相信,经过修改的二进制搜索应该能够有效地执行此分区.对于 n
输入值和 b
输出存储桶,运行时间应为最差的 O(b.log(n))
.有没有人有一种算法,甚至只是名称,我都可以查找吗?
I believe a modified binary search should be able to perform this partitioning efficiently. For n
input values, and b
output buckets, the run time should be at worst O(b.log(n))
. Does any one have an algorithm for it, even just the name, so I can look it up?
推荐答案
要有效解决此问题,需要假设一个假设,即根据 sameBucket(左,右),已排序范围两端的元素属于同一存储桶
,那么 left
和 right
之间的所有值也必须属于同一存储桶.
Solving this problem efficiently requires an assumption that if elements at two ends of a sorted range belong to the same bucket according to sameBucket(left, right)
, then all values between left
and right
must belong to the same bucket as well.
我相信,经过修改的二进制搜索应该能够有效地执行此分区
I believe a modified binary search should be able to perform this partitioning efficiently
是的,您可以运行二进制搜索,如下所示:
Yes, you can run a binary search, like this:
- 将
nextBucket
设置为零 - 将
left
设置为nextBucket
,并将right
设置为输入数组的末尾 - 将
mid
设置为left
和right
之间的中点 - 如果
sameBucket(nextBucket,mid)
是true
,则将left
移到mid
;否则,将right
移至mid
- 如果
left == right
,则退出循环;否则,请返回步骤3 - 循环完成后,
left
是下一个分区索引.nextBucket
和left
之间的所有项目都在同一个存储桶中. - 将
nextBucket
设置为left + 1
- 如果
nextBucket
等于n
,那么您完成了;否则,请返回步骤2.
- set
nextBucket
to zero - set
left
tonextBucket
andright
to the end of the input array - set
mid
to midpoint betweenleft
andright
- if
sameBucket(nextBucket, mid)
istrue
, moveleft
tomid
; otherwise, moveright
tomid
- if
left == right
, exit the loop; otherwise, go back to step 3 - Once the loop is done,
left
is the next partition index. All items betweennextBucket
andleft
are in the same bucket. - Set
nextBucket
toleft+1
- If
nextBucket
is equal ton
, you are done; otherwise, go back to step 2.
我不认为此算法有一个特殊的名称-这是伪装得不好的二进制搜索.
I don't think this algorithm has a special name - it is a poorly disguised binary search.
这篇关于二进制分区n个排序的数据指向`O(b.log(n))`中的b个桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!