跳过列表,他们真的表现与Pugh纸索赔一样好吗? [英] Skip lists, are they really performing as good as Pugh paper claim?

查看:173
本文介绍了跳过列表,他们真的表现与Pugh纸索赔一样好吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用最少额外的内存开销来实现与BST一样好的跳过列表,此刻即使不考虑任何内存限制,我的SkipList实现的性能远远不是一个非常幼稚的平衡BST实现 - 也就是说,手工制作的BTS :) -



作为参考,我使用的是William Pugh的原始论文



最后但并非最不重要的是,内存使用情况图,您可能会期待确认,如果我们增加每个节点的级别数量,我们将显着影响内存指纹。内存使用量的计算方式就是存储所有节点的附加指针所需的空间总和。



所有这一切,您是否可以向我发表评论,说明如何实现与BTS相同的跳过列表 - 不计算额外的内存开销?



我知道DrDobbs的文章 LINK 关于SkipList,我通过所有的文章,搜索和插入操作的代码完全匹配原始的实现从 PUG89 所以应该和我一样好,文章没有提供任何性能分析,我没有找到任何其他来源。你可以帮助我吗?



任何评论都非常感谢!



谢谢! :

解决方案

历史



自威廉姆斯以来, Pugh写了他的原始论文。我们在他的文章中没有提到关于CPU和操作系统的内存层次结构,今天已经成为这样一个普遍的焦点(现在通常与算法复杂性一样重要)。



他的基准测试输入案例有2 ^ 16个元素,硬件返回的时间通常最多可以有32位扩展内存寻址。这使得指针的大小一半的大小或比我们今天在64位机器上使用的大一点。同时,例如,字符串字段可能同样大,使得存储在跳过列表中的元素与跳过节点所需的指针之间的比率潜在地更小,特别是考虑到我们经常需要每个跳过节点的指针数



C编译器在优化方面并没有如此积极,就像寄存器分配和指令选择一样。甚至平均的手写大会也可以在性能上提供显着的好处。编译器提示像注册 inline 在这些时间实际上是一个很大的交易。虽然这可能是一个很好的原因,因为平衡的BST和跳过列表实现将在这里平等,甚至基本循环的优化是一个更加手动的过程。当优化是越来越多的手工过程时,更容易实现的东西通常更容易优化。跳过列表通常被认为比平衡树更容易实现。



所以这些因素可能在当时的Pugh的结论中有一部分。然而,时代已经改变:硬件已经改变,操作系统发生了变化,编译器发生了变化,对这些主题进行了更多的研究。



实现



除此之外,让我们有一些乐趣并实现一个基本的跳过列表。我最终在懒惰之后 here 适应实施。这是一种运行的实现方式,与今天很多易于访问的示例性跳过列表实现几乎不同。



我们将进行比较我们执行的执行对 std :: set 几乎总是作为一个红黑树*。



*有些人可能会想知道为什么我使用 0 而不是 nullptr 和这样的事情。这是一种习惯在我的工作场所,我们仍然需要编写针对广泛编译器的开放库,包括仅支持C ++ 03的编译器,所以我仍然习惯于编写较低/中级的实现代码,有时甚至在C,所以请原谅我编写这段代码的旧样式。

  #include< iostream> 
#include< algorithm>
#include< cstdlib>
#include< ctime>
#include< cmath>
#include< vector>
#include< cassert>
#include< cstring>
#include< set>

使用namespace std;

static const int max_level = 32;
static const float probability = 0.5;

static double sys_time()
{
return static_cast< double>(clock())/ CLOCKS_PER_SEC;
}

static int random_level()
{
int lvl = 1;
while((static_cast< float>(rand())/ RAND_MAX)< probability&& lvl< max_level)
++ lvl;
return lvl;
}

模板< class T>
class SkipSet
{
public:
SkipSet():head(0)
{
head = create_node(max_level,T());
level = 0;
}

〜SkipSet()
{
while(head)
{
Node * to_destroy = head;
head = head-> next [0];
destroy_node(to_destroy);
}
}

bool包含(const T& value)const
{
const Node * node = head; (int i = level; i> = 0; - i)
{
while(node-> next [i]&& node-> next [ i] - > value< value)
node = node-> next [i];
}
node = node-> next [0];
返回节点&& node-> value == value;
}

void insert(const T& value)
{
Node * node = head;
Node * update [max_level + 1];
memset(update,0,sizeof(Node *)*(max_level + 1));

for(int i = level; i> = 0; i--)
{
while(node-> next [i]&& > next [i] - > value< value)
node = node-> next [i];
update [i] = node;
}
node = node-> next [0];

if(!node || node-> value!= value)
{
int lvl = random_level();
assert(lvl> = 0);
if(lvl> level)
{
for(int i = level + 1; i< = lvl; i ++){
update [i] = head;
}
level = lvl;
}
node = create_node(lvl,value); (int i = 0; i< = lvl; i ++){
node-> next [i] = update [i] - > next [i];

update [i] - > next [i] = node;
}
}
}

bool erase(const T& value)
{
Node * node = head;
Node * update [max_level + 1];
memset(update,0,sizeof(Node *)*(max_level + 1));

for(int i = level; i> = 0; i--)
{
while(node-> next [i]&& > next [i] - > value< value)
node = node-> next [i];
update [i] = node;
}
node = node-> next [0];

if(node-> value == value)
{
for(int i = 0; i< = level; i ++){
if update [i] - > next [i]!= node)
break;
update [i] - > next [i] = node-> next [i];
}
destroy_node(node);
while(level> 0&&head;> next [level])
--level;
返回true;
}
返回false;
}

private:
struct Node
{
T value;
struct Node ** next;
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = malloc(sizeof(Node));
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value);

void * next_mem = calloc(level + 1,sizeof(Node *));
new_node-> next = static_cast< Node **>(next_mem);
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
free(node-> next);
free(node);
}

Node * head;
int level;
};

模板< class T>
bool包含(const std :: set< T>& cont,const T& val)
{
return cont.find(val)!= cont.end()
}

模板< class T>
bool包含(const SkipSet& T& cont,const T& val)
{
return cont.contains(val);
}

template< class Set,class T>
void benchmark(int num,const T * elements,const T * search_elements)
{
const double start_insert = sys_time();
设置element_set; (int j = 0; j#++ j)
element_set.insert(elements [j]);

cout<<< - 插入< num < 元素在< (sys_time() - start_insert)<< secs<终点

const double start_search = sys_time();
int num_found = 0; (int j = 0; j#++ j)
{
if(contains(element_set,search_elements [j]))
++ num_found;
}
cout<<< - Found< num_found<<< 元素在< (sys_time() - start_search)<< secs<终点

const double start_erase = sys_time();
int num_erased = 0;
for(int j = 0; j#++ j)
{
if(element_set.erase(search_elements [j]))
++ num_erased;
}
cout<<< - 已擦除< num_found<<< 元素在< (sys_time() - start_erase)<< secs<终点
}

int main()
{
常量int num_elements = 200000;
vector< int>元素(num_elements); (int j = 0; j< num_elements; ++ j)
元素[j] = j;
random_shuffle(elements.begin(),elements.end());

vector< int> search_elements = elements;
random_shuffle(search_elements.begin(),search_elements.end());

typedef std :: set< int> Set1;
typedef SkipSet< int> Set2; (int j = 0; j< 3; ++ j)


{
cout< std :: set<<终点
基准< Set1>(num_elements,& elements [0],& search_elements [0]);
cout<<<终点

cout<<< SkipSet<<终点
基准< Set2>(num_elements,& elements [0],& search_elements [0]);
cout<<<终点
}
}

在GCC 5.2,-O2上,我得到:

  std :: set 
- 在0.104869秒内插入20万个元素
- 找到20万个元素0.078351秒
- 在0.098208秒内删除了20万个元素

SkipSet
- 在0.188765秒内插入了20万个元素
- 在0.160895秒内找到了20万个元素
- 在0.162444秒内删除了20万个元素

...这很可怕。



优化



然而,有一个明显的优化我们可以使。如果我们查看 Node ,它的当前字段如下所示:

  struct Node 
{
T value;
struct Node ** next;
};

这意味着Node字段的内存及其下一个指针列表是两个单独的块,可能他们之间的距离非常遥远,如下所示:

  [节点字段] ------------ --------> [next0,next1,...,null] 

这对于参考的地方来说并不好。如果我们想在这里改进事情,我们应该将这些内存块合并成一个连续的结构,如下所示:

  [Node fields ,next0,next1,...,null] 

我们可以通过可变长度的结构C语言中常见的一些常见问题。在C ++中实现并不直接支持它有点尴尬,但是我们可以像这样效仿效果:

  struct Node 
{
T value;
struct Node * next [1];
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = malloc(sizeof(Node)+ level * sizeof(Node *));
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value); (int j = 0; j< level + 1; ++ j)
new_node-> next [j] = 0;
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
free(node);
}

通过这个适度的调整,我们现在有以下时间:

  SkipSet(之前)
- 在0.188765秒内插入了20万个元素
- 在0.160895秒内发现了20万个元素
- 在0.162444秒内删除了20万个元素

SkipSet(After)
- 以0.132322秒插入了200000个元素
- 在0.127989秒内找到了20万个元素
- - 在0.130889秒内删除了20万个元素

...这让我们更接近于与 std :: set



随机数生成器



一个真正有效的跳过列表实现通常需要一个非常快的RNG。然而,我发现在一个快速的剖析过程中,只有很少的时间花在生成一个随机的水平/高度,几乎不足以考虑它的一个热点。这也只会影响插入时间,除非它提供了更均匀的分布,所以我决定跳过这个优化。



内存分配器



在这一点上,我会说,我们有一个非常合理的概述,我们可以期望跳过列表实现与BST:

 插入
- std :: set:0.104869 secs
- SkipList:0.132322 secs

搜索:
- - std :: set:0.078351 secs
- SkipList:0.127989 secs

删除:
- std :: set:0.098208 secs
- SkipList:0.130889 secs

但是,如果我们要进一步加强士兵,我们可以使用固定的分配器。在这一点上,我们正在欺骗一点,因为 std :: set 旨在与符合标准分配器的接口要求的任何通用分配器配合使用。但是让我们看一下使用固定的分配器:

  #include< iostream> 
#include< iomanip>
#include< algorithm>
#include< cstdlib>
#include< ctime>
#include< cmath>
#include< vector>
#include< cassert>
#include< set>

使用namespace std;

static const int max_level = 32;

class FixedAlloc
{
public:
FixedAlloc():root_block(0),free_element(0),type_size(0),block_size(0),block_num (0)

$ b FixedAlloc(int itype_size,int iblock_size):root_block(0),free_element(0),type_size(0),block_size(0) block_num(0)
{
init(itype_size,iblock_size);
}

〜FixedAlloc()
{
purge();
}

void init(int new_type_size,int new_block_size)
{
purge();
block_size = max(new_block_size,type_size);
type_size = max(new_type_size,static_cast< int>(sizeof(FreeElement)));
block_num = block_size / type_size;
}

void purge()
{
while(root_block)
{
Block * block = root_block;
root_block = root_block-> next;
free(block);
}
free_element = 0;
}

void * allocate()
{
assert(type_size> 0);
if(free_element)
{
void * mem = free_element;
free_element = free_element-> next_element;
return mem;
}

//创建新块。
void * new_block_mem = malloc(sizeof(Block) - 1 + type_size * block_num);
Block * new_block = static_cast< Block *>(new_block_mem);
new_block->下一个= root_block;
root_block = new_block;

//将所有新块的元素推送到空闲池中。
char * mem = new_block-> mem; (int j = 1; j< block_num; ++ j)
{
FreeElement * element = reinterpret_cast< FreeElement *>(mem + j * type_size)
element-> next_element = free_element;
free_element = element;
}
return mem;
}

void deallocate(void * mem)
{
FreeElement * element = static_cast< FreeElement *>(mem);
element-> next_element = free_element;
free_element = element;
}

void swap(FixedAlloc& other)
{
std :: swap(free_element,other.free_element);
std :: swap(root_block,other.root_block);
std :: swap(type_size,other.type_size);
std :: swap(block_size,other.block_size);
std :: swap(block_num,other.block_num);
}

private:
struct Block
{
Block * next;
char mem [1];
};
struct FreeElement
{
struct FreeElement * next_element;
};

//禁止复制。
FixedAlloc(const FixedAlloc&);
FixedAlloc& operator =(const FixedAlloc&);

struct Block * root_block;
struct FreeElement * free_element;
int type_size;
int block_size;
int block_num;
};

static double sys_time()
{
return static_cast< double>(clock())/ CLOCKS_PER_SEC;
}

static int random_level()
{
int lvl = 1;
while(rand()%2 == 0&& lvl< max_level)
++ lvl;
return lvl;
}

模板< class T>
class SkipSet
{
public:
SkipSet():head(0)
{
for(int j = 0; j allocs [j] .init(sizeof(Node)+(j + 1)* sizeof(Node *),4096);
head = create_node(max_level,T());
level = 0;
}

〜SkipSet()
{
while(head)
{
Node * to_destroy = head;
head = head-> next [0];
destroy_node(to_destroy);
}
}

bool包含(const T& value)const
{
const Node * node = head; (int i = level; i> = 0; - i)
{
while(node-> next [i]&& node-> next [ i] - > value< value)
node = node-> next [i];
}
node = node-> next [0];
返回节点&& node-> value == value;
}

void insert(const T& value)
{
Node * node = head;
Node * update [max_level + 1] = {0}; (int i = level; i> = 0; - i)
{
while(node-> next [i]&& node-> next [ i] - > value< value)
node = node-> next [i];
update [i] = node;
}
node = node-> next [0];

if(!node || node-> value!= value)
{
const int lvl = random_level();
assert(lvl> = 0);
if(lvl> level)
{
for(int i = level + 1; i< = lvl; ++ i)
update [i] = head;
levels = lvl;
}
node = create_node(lvl,value); (int i = 0; i <= lvl; ++ i)

{
node-> next [i] = update [i] - > next [i] ;
update [i] - > next [i] = node;
}
}
}

bool erase(const T& value)
{
Node * node = head;
Node * update [max_level + 1] = {0}; (int i = level; i> = 0; - i)
{
while(node-> next [i]&& node-> next [ i] - > value< value)
node = node-> next [i];
update [i] = node;
}
node = node-> next [0];

if(node-> value == value)
{
for(int i = 0; i< = level; ++ i){
if(update [i] - > next [i]!= node)
break;
update [i] - > next [i] = node-> next [i];
}
destroy_node(node);
while(level> 0&&head;> next [level])
--level;
返回true;
}
返回false;
}

void swap(SkipSet< T>& other)
{
for(int j = 0; j allocs [j] .swap(other.allocs [j]);
std :: swap(head,other.head);
std :: swap(level,other.level);
}

private:
struct Node
{
T value;
int num;
struct Node * next [1];
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = allocs [level-1] .allocate();
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value);
new_node-> num = level; (int j = 0; j< level + 1; ++ j)
new_node-> next [j] = 0;
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
allocs [node-> num-1] .deallocate(node);
}

FixedAlloc allocs [max_level];
Node * head;
int level;
};

模板< class T>
bool包含(const std :: set< T>& cont,const T& val)
{
return cont.find(val)!= cont.end()
}

模板< class T>
bool包含(const SkipSet& T& cont,const T& val)
{
return cont.contains(val);
}

template< class Set,class T>
void benchmark(int num,const T * elements,const T * search_elements)
{
const double start_insert = sys_time();
设置element_set; (int j = 0; j#++ j)
element_set.insert(elements [j]);

cout<<< - 插入< num < 元素在< (sys_time() - start_insert)<< secs<终点

const double start_search = sys_time();
int num_found = 0; (int j = 0; j#++ j)
{
if(contains(element_set,search_elements [j]))
++ num_found;
}
cout<<< - Found< num_found<<< 元素在< (sys_time() - start_search)<< secs<终点

const double start_erase = sys_time();
int num_erased = 0;
for(int j = 0; j#++ j)
{
if(element_set.erase(search_elements [j]))
++ num_erased;
}
cout<<< - 已擦除< num_found<<< 元素在< (sys_time() - start_erase)<< secs<终点
}

int main()
{
const int num_elements = 200000;
vector< int>元素(num_elements); (int j = 0; j< num_elements; ++ j)
元素[j] = j;
random_shuffle(elements.begin(),elements.end());

vector< int> search_elements = elements;
random_shuffle(search_elements.begin(),search_elements.end());

typedef std :: set< int> Set1;
typedef SkipSet< int> Set2;

cout<<<固定的< setprecision(3); (int j = 0; j <2; ++ j)
{
cout<< std :: set<<终点
基准< Set1>(num_elements,& elements [0],& search_elements [0]);
cout<<<终点

cout<<< SkipSet<<终点
基准< Set2>(num_elements,& elements [0],& search_elements [0]);
cout<<<终点
}
}

*我也做了一个小的调整 random_level ,以便在发现这似乎工作得很好之后,简单地假设概率为50%。



By using a fixed allocator, we can quickly allocate and deallocate elements using a very simple constant-time strategy, and more importantly, allocate nodes in a way such that nodes with the same height (the most frequently accessed together) are allocated more contiguously with respect to each other (though not in an ideal sequential order). This improves the times to:

Insertion 
-- std::set: 0.104869 secs
-- SkipList: 0.103632 secs

Search:
-- std::set: 0.078351 secs
-- SkipList: 0.089910 secs

Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.089224 secs

... which isn’t bad for about 40 minutes of work against std::set which has been poked and prodded and tuned by experts all over the industry. We also have faster removals than std::set (insertion times fluctuate a bit on my machine, I’d say they are roughly on par).



Of course we cheated to apply this last step. Using a custom allocator tilts things in our favor, and also changes the characteristics of the container in a way such that its memory cannot be freed until it is cleared, destroyed, or compacted. It can mark the memory as unused and reclaim it upon subsequent insertions, but it cannot make the memory it uses available for those outside the skip list to use. I also didn’t bother to pay attention to proper alignment for all possible types of T which would be necessary to truly generalize it.



Sorted Input



Let’s try using this against sorted input. To do so, simply comment out the two random_shuffle statements. On my end, I now get these times:

std::set 
-- Inserted 200000 elements in 0.044 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.019 secs

SkipSet
-- Inserted 200000 elements in 0.027 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.016 secs

... and now our SkipSet outperforms std::set in all areas, though just for this one kind of exceptional case.



Conclusion



So I don’t know exactly what this says about skip lists. With hardly any time and effort, we got pretty close to rivaling std::set*. Yet we didn’t beat it, and we had to cheat with a memory allocator to get really close.



* Note that mileage may vary across machines, vendor implementations of std::set, etc.



Times have changed quite a bit since the paper Pugh wrote in 1989.



Today the benefits of locality of reference dominates things to a point where even a linearithmic complexity algorithm can outperform a linear one provided that the former is considerably more cache or page-friendly. Paying close attention to the way the system grabs chunks of memory from upper levels of the memory hierarchy (ex: secondary stage) with slower but bigger memory and down to the little L1 cache line and teeny register is a bigger deal than ever before, and no longer \"micro\" if you ask me when the benefits can rival algorithmic improvements.



The skip list is potentially crippled here given the considerably larger size of nodes, and just as importantly, the variable size of nodes (which makes them difficult to allocate very efficiently).



One thing we didn’t look at is the localized nature in which a skip list updates on insertion. This tends to impact much fewer areas than a balancing tree requires to rebalance by rotating parent nodes. As a result, a skip list can offer a potentially more efficient implementation of a concurrent set or map.



Another promising characteristic of a skip list is that it is simply a linked-list a the lowest level. As a result, it offers very simple linear-time sequential traversal without having to descend down the branches of the tree with linearithmic complexity, so it is potentially very good if we want to do a lot of linear-time iterations through all the elements contained.



But always remember:



A technique is only as good as it can be applied in the hands of the implementor.


I'm trying to implement a skip list that perform as good as a BST using a minimum additional memory overhead, at the moment even not considering any memory restriction the performance of my SkipList implementation is far away from even a very naive Balanced BST implementation -so to say, handcrafted BTS :) -

As reference I'm using the original paper from William Pugh PUG89 and the implementation i found in Algorithms in C from Sedgewick -13.5-. My code is a recursive implementation, here's the clue of the insert and find operation:

sl_node* create_node()
{
    short lvl {1};
    while((dist2(en)<p)&&(lvl<max_level))
        ++lvl;
    return new sl_node(lvl);
}

void insert_impl(sl_node* cur_node,
        sl_node* new_node,
        short lvl)
{
    if(cur_node->next_node[lvl]==nullptr || cur_node->next_node[lvl]->value > new_node->value){
        if(lvl<new_node->lvl){
            new_node->next_node[lvl] = cur_node->next_node[lvl];
            cur_node->next_node[lvl] = new_node;
        }
        if(lvl==0) return;
        insert_impl(cur_node,new_node,lvl-1);
        return;
    }
    insert_impl(cur_node->next_node[lvl],new_node,lvl);
}
sl_node* insert(long p_val)
{
    sl_node* new_node = create_node();
    new_node->value = p_val;
    insert_impl(head, new_node,max_level-1);
    return new_node;
}

And this is the code for the find operation:

sl_node* find_impl(sl_node* cur_node,
        long p_val,
        int lvl)
{
    if(cur_node==nullptr) return nullptr;
    if(cur_node->value==p_val) return cur_node;
    if(cur_node->next_node[lvl] == nullptr || cur_node->next_node[lvl]->value>p_val){
        if(lvl==0) return nullptr;
        return find_impl(cur_node,p_val,lvl-1);
    }
    return find_impl(cur_node->next_node[lvl],p_val,lvl);
}

sl_node* find(long p_val)
{
    return find_impl(head,p_val,max_level-1);
}

A sl_node -skip list node- looks like this:

struct sl_node
{
    long  value;
    short lvl;
    sl_node** next_node;

    sl_node(int l) : lvl(l)
    {
        next_node = new sl_node*[l];
        for(short i{0};i<l;i++)
            next_node[i]=nullptr;
    }
    ~sl_node()
    {
        delete[] next_node;
    }
};

As you can see this is implementation has nothing special nor advanced if compared to the book implementation, i will not share the Balaced BTS code since I don't think is needed here, but is a very basic BTS with a rebalance function triggered during the insertion when the new node height is greather than 16*lg(n) where n is the number of nodes.

So to say, i rebalance the three only if the maximum height is 16 times greather than the best theoretical one, as i said, this straighforward homemade BST performs much better than the homemade Skip List.

But first, let's have a look at some data, using p=.5 and n=262144 the level of the nodes in the SkipList has the following distribution:

1:141439, 53.9547%
2:65153, 24.8539%
3:30119, 11.4895%
4:13703, 5.22728%
5:6363, 2.42729%
6:2895, 1.10435%
7:1374, 0.524139%
8:581, 0.221634%
9:283, 0.107956%
10:117, 0.044632%
11:64, 0.0244141%
12:31, 0.0118256%
13:11, 0.00419617%
14:5, 0.00190735%
15:1, 0.00038147%
16:5, 0.00190735%

Which almost perfectly -oh, this is big one!- match the theory from the article, that is: 50% level 1, 25% level 2 and so on and so forth. The input data came from my best available pseudo-random number generator aka std::random_device with std::default_random_engine and an uniform int distribution. The input looks pretty random to me :) !

The time required to search for 'all' the 262144 elements in the SkipList -in random order- is 315ms on my machine, whereas for the same search operation on the naive BTS the required time is 134ms, so the BTS is almost twice faster than the SkipList. That is not exactly what i was expecting from "Skip list algoriths have the same asymptotic expected time bounds as balanced trees and are simple, faster and use less space" PUG89.

The time required for the 'insertion' of the nodes is 387ms for the SkipList and 143ms for the BTS, again a naive BST perform better.

The things get little more interesting if instead of using a random sequence of input number i use a sorted sequence, here my poor homemade BST become slow, and the insertion of 262144 sorted int's takes 2866ms whereas the SkipList require only 168ms.

But, when come to the search time, the BST is still faster! for the sorted input we have 234ms versus 77ms, this BST is 3x faster.

With different values of the p-factor i got slightly different performance results:

And last but not least, the memory usage graph, which as you may expect confirm that if we increase the amount of levels per node we impact significantly the memory fingerprint. The memory usage is calculated just as a sum of the space required to store the additional pointers for all the node.

After all this, does anybody of you can provide me a comment on how to implement a SkipList performing as good as a BTS -not counting the additional memory overhead-?

I know about the article from DrDobbs LINK about SkipList, and i went thru all the paper, the code for the search and insert operation match exactly the original implementation from PUG89 so should be as good as mine, and the article does not provide any performance analysis anyway, I did not found any other source. Can you help me?

Any comment is highly appreciated!

Thanks! :)

解决方案

History

Times have changed a bit since William Pugh wrote his original paper. We see no mention in his paper about the memory hierarchy of the CPU and operating system which has become such a prevalent focus today (now often equally as important as algorithmic complexity).

His input case for benchmarking had a measly 2^16 elements, and hardware back then typically had, at most, 32-bit extended memory addressing available. This made the size of a pointer half the size or smaller than what we're used to today on 64-bit machines. Meanwhile a string field, e.g., could be just as large, making the ratio between the elements stored in the skip list and the pointers required by a skip node potentially a lot smaller, especially given that we often need a number of pointers per skip node.

C Compilers weren't so aggressive at optimization back then with respect to things like register allocation and instruction selection. Even average hand-written assembly could often provide a significant benefit in performance. Compiler hints like register and inline actually made a big deal during those times. While this might seem kind of moot since both a balanced BST and skip list implementation would be on equal footing here, optimization of even a basic loop was a more manual process. When optimization is an increasingly manual process, something that is easier to implement is often easier to optimize. Skip lists are often considered to be a lot easier to implement than a balancing tree.

So all of these factors probably had a part in Pugh's conclusions at the time. Yet times have changed: hardware has changed, operating systems have changed, compilers have changed, more research has been done into these topics, etc.

Implementation

With that aside, let's have some fun and implement a basic skip list. I ended up adapting the implementation available here out of laziness. It's a run-of-the-mill kind of implementation, hardly different from the plethora of easily-accessible exemplary skip list implementations out there today.

We'll be comparing the performance of our implementation against std::set which is almost always implemented as a red-black tree*.

* Some might wonder why I use 0 instead of nullptr and things of that sort. It's a habit. At my workplace, we still have to write open libraries that target a wide range of compilers including those that only support C++03, so I'm still used to writing lower/mid-level implementation code that way, and sometimes even in C, so please forgive the old style in which I wrote this code.

#include <iostream>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <cstring>
#include <set>

using namespace std;

static const int max_level = 32;
static const float probability = 0.5;

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

static int random_level() 
{
    int lvl = 1;
    while ((static_cast<float>(rand()) / RAND_MAX) < probability && lvl < max_level)
        ++lvl;
    return lvl;
}

template <class T>
class SkipSet
{
public:
    SkipSet(): head(0)
    {
        head = create_node(max_level, T());
        level = 0;
    }

    ~SkipSet()
    {
        while (head)
        {
            Node* to_destroy = head;
            head = head->next[0];
            destroy_node(to_destroy);
        }
    }

    bool contains(const T& value) const
    {
        const Node* node = head;
        for (int i=level; i >= 0; --i)
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
        }
        node = node->next[0];
        return node && node->value == value;
    }

    void insert(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1];
        memset(update, 0, sizeof(Node*)*(max_level + 1));

        for (int i = level; i >= 0; i--) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (!node || node->value != value)
        {
            int lvl = random_level();
            assert(lvl >= 0);
            if (lvl > level) 
            {
                for (int i = level + 1; i <= lvl; i++) {
                    update[i] = head;
                }
                level = lvl;
            }
            node = create_node(lvl, value);
            for (int i = 0; i <= lvl; i++) {
                node->next[i] = update[i]->next[i];
                update[i]->next[i] = node;
            }            
        }
    }

    bool erase(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1];
        memset(update, 0, sizeof(Node*)*(max_level + 1));

        for (int i = level; i >= 0; i--) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (node->value == value)
        {
            for (int i = 0; i <= level; i++) {
                if (update[i]->next[i] != node)
                    break;
                update[i]->next[i] = node->next[i];
            }
            destroy_node(node);
            while (level > 0 && !head->next[level])
                --level;
            return true;
        }
        return false;
    }

private:
    struct Node
    {
        T value;
        struct Node** next;
    };

    Node* create_node(int level, const T& new_value)
    {
        void* node_mem = malloc(sizeof(Node));
        Node* new_node = static_cast<Node*>(node_mem);
        new (&new_node->value) T(new_value);

        void* next_mem = calloc(level+1, sizeof(Node*));
        new_node->next = static_cast<Node**>(next_mem);
        return new_node;
    }

    void destroy_node(Node* node)
    {
        node->value.~T();
        free(node->next);
        free(node);
    }

    Node* head;
    int level;
};

template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
    return cont.find(val) != cont.end();
}

template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
    return cont.contains(val);
}

template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
    const double start_insert = sys_time();
    Set element_set;
    for (int j=0; j < num; ++j)
        element_set.insert(elements[j]);
    cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;

    const double start_search = sys_time();
    int num_found = 0;
    for (int j=0; j < num; ++j)
    {
        if (contains(element_set, search_elements[j]))
            ++num_found;
    }
    cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;

    const double start_erase = sys_time();
    int num_erased = 0;
    for (int j=0; j < num; ++j)
    {
        if (element_set.erase(search_elements[j]))
            ++num_erased;
    }
    cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}

int main()
{
    const int num_elements = 200000;
    vector<int> elements(num_elements);
    for (int j=0; j < num_elements; ++j)
        elements[j] = j;
    random_shuffle(elements.begin(), elements.end());

    vector<int> search_elements = elements;
    random_shuffle(search_elements.begin(), search_elements.end());

    typedef std::set<int> Set1;
    typedef SkipSet<int> Set2;

    for (int j=0; j < 3; ++j)
    {
        cout << "std::set" << endl;
        benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;

        cout << "SkipSet" << endl;
        benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;
    }
}

On GCC 5.2, -O2, I get this:

std::set
-- Inserted 200000 elements in 0.104869 secs
-- Found 200000 elements in 0.078351 secs
-- Erased 200000 elements in 0.098208 secs

SkipSet
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs

... which is pretty awful. We're around twice as slow all across the board.

Optimization

Yet there is a glaring optimization we can make. If we look at Node, its current fields look like this:

struct Node
{
    T value;
    struct Node** next;
};

This implies that the memory for the Node fields and its list of next pointers are two separate blocks, possibly with a very distant stride between them like so:

    [Node fields]-------------------->[next0,next1,...,null]

This does poorly for locality of reference. If we want to improve things here, we should merge these memory blocks into a single contiguous structure, like so:

    [Node fields,next0,next1,...,null]

We can achieve this through the variable-length struct idiom common in C. It's a little bit awkward to implement in C++ which doesn't support it so directly, but we can emulate the effect like so:

struct Node
{
    T value;
    struct Node* next[1];
};

Node* create_node(int level, const T& new_value)
{
    void* node_mem = malloc(sizeof(Node) + level * sizeof(Node*));
    Node* new_node = static_cast<Node*>(node_mem);
    new (&new_node->value) T(new_value);
    for (int j=0; j < level+1; ++j)
        new_node->next[j] = 0;
    return new_node;
}

void destroy_node(Node* node)
{
    node->value.~T();
    free(node);
}

With this modest tweak, we now have these times:

SkipSet (Before)
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs

SkipSet (After)
-- Inserted 200000 elements in 0.132322 secs
-- Found 200000 elements in 0.127989 secs
-- Erased 200000 elements in 0.130889 secs

... which gets us considerably closer to rivaling the performance of std::set.

Random Number Generator

A truly efficient skip list implementation will generally want a very fast RNG. Nevertheless, I found during a quick profiling session that only a very teeny portion of time is spent generating a random level/height, hardly enough to consider it much of a hotspot. It would also only impact the insertion times unless it provided a more uniform distribution, so I've decided to skip this optimization.

Memory Allocator

At this point, I'd say we have a pretty reasonable overview of what we can expect with a skip list implementation vs. a BST:

Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.132322 secs

Search:
-- std::set: 0.078351 secs
-- SkipList: 0.127989 secs

Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.130889 secs

However, if we want to soldier on a little further, we can utilize a fixed allocator. At this point, we're cheating a little bit as std::set is designed to work with any general-purpose allocator which conforms to the interface requirements of a standard allocator. But let's have a look at using a fixed allocator:

#include <iostream>
#include <iomanip>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <set>

using namespace std;

static const int max_level = 32;

class FixedAlloc
{
public:
    FixedAlloc(): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
    {
    }

    FixedAlloc(int itype_size, int iblock_size): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
    {
        init(itype_size, iblock_size);
    }

    ~FixedAlloc()
    {
        purge();
    }

    void init(int new_type_size, int new_block_size)
    {
        purge();
        block_size = max(new_block_size, type_size);
        type_size = max(new_type_size, static_cast<int>(sizeof(FreeElement)));
        block_num = block_size / type_size;
    }

    void purge()
    {
        while (root_block)
        {
            Block* block = root_block;
            root_block = root_block->next;
            free(block);
        }
        free_element = 0;
    }

    void* allocate()
    {
        assert(type_size > 0);
        if (free_element)
        {
            void* mem = free_element;
            free_element = free_element->next_element;
            return mem;
        }

        // Create new block.
        void* new_block_mem = malloc(sizeof(Block) - 1 + type_size * block_num);
        Block* new_block = static_cast<Block*>(new_block_mem);
        new_block->next = root_block;
        root_block = new_block;

        // Push all but one of the new block's elements to the free pool.
        char* mem = new_block->mem;
        for (int j=1; j < block_num; ++j)
        {
            FreeElement* element = reinterpret_cast<FreeElement*>(mem + j * type_size);
            element->next_element = free_element;
            free_element = element;
        }
        return mem;
    }

    void deallocate(void* mem)
    {
        FreeElement* element = static_cast<FreeElement*>(mem);
        element->next_element = free_element;
        free_element = element;
    }

    void swap(FixedAlloc& other)
    {
        std::swap(free_element, other.free_element);
        std::swap(root_block, other.root_block);
        std::swap(type_size, other.type_size);
        std::swap(block_size, other.block_size);
        std::swap(block_num, other.block_num);
    }

private:
    struct Block
    {
        Block* next;
        char mem[1];
    };
    struct FreeElement
    {
        struct FreeElement* next_element;
    };

    // Disable copying.
    FixedAlloc(const FixedAlloc&);
    FixedAlloc& operator=(const FixedAlloc&);

    struct Block* root_block;
    struct FreeElement* free_element;
    int type_size;
    int block_size;
    int block_num;
};

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

static int random_level()
{
    int lvl = 1;
    while (rand()%2 == 0 && lvl < max_level)
        ++lvl;
    return lvl;
}

template <class T>
class SkipSet
{
public:
    SkipSet(): head(0)
    {
        for (int j=0; j < max_level; ++j)
            allocs[j].init(sizeof(Node) + (j+1)*sizeof(Node*), 4096);
        head = create_node(max_level, T());
        level = 0;
    }

    ~SkipSet()
    {
        while (head)
        {
            Node* to_destroy = head;
            head = head->next[0];
            destroy_node(to_destroy);
        }
    }

    bool contains(const T& value) const
    {
        const Node* node = head;
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
        }
        node = node->next[0];
        return node && node->value == value;
    }

    void insert(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1] = {0};
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (!node || node->value != value)
        {
            const int lvl = random_level();
            assert(lvl >= 0);
            if (lvl > level) 
            {
                for (int i = level + 1; i <= lvl; ++i)
                    update[i] = head;
                level = lvl;
            }
            node = create_node(lvl, value);
            for (int i = 0; i <= lvl; ++i) 
            {
                node->next[i] = update[i]->next[i];
                update[i]->next[i] = node;
            }            
        }
    }

    bool erase(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1] = {0};
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (node->value == value)
        {
            for (int i=0; i <= level; ++i) {
                if (update[i]->next[i] != node)
                    break;
                update[i]->next[i] = node->next[i];
            }
            destroy_node(node);
            while (level > 0 && !head->next[level])
                --level;
            return true;
        }
        return false;
    }

    void swap(SkipSet<T>& other)
    {
        for (int j=0; j < max_level; ++j)
            allocs[j].swap(other.allocs[j]);
        std::swap(head, other.head);
        std::swap(level, other.level);
    }

private:
    struct Node
    {
        T value;
        int num;
        struct Node* next[1];
    };

    Node* create_node(int level, const T& new_value)
    {
        void* node_mem = allocs[level-1].allocate();
        Node* new_node = static_cast<Node*>(node_mem);
        new (&new_node->value) T(new_value);
        new_node->num = level;
        for (int j=0; j < level+1; ++j)
            new_node->next[j] = 0;
        return new_node;
    }

    void destroy_node(Node* node)
    {
        node->value.~T();
        allocs[node->num-1].deallocate(node);
    }

    FixedAlloc allocs[max_level];
    Node* head;
    int level;
};

template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
    return cont.find(val) != cont.end();
}

template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
    return cont.contains(val);
}

template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
    const double start_insert = sys_time();
    Set element_set;
    for (int j=0; j < num; ++j)
        element_set.insert(elements[j]);
    cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;

    const double start_search = sys_time();
    int num_found = 0;
    for (int j=0; j < num; ++j)
    {
        if (contains(element_set, search_elements[j]))
            ++num_found;
    }
    cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;

    const double start_erase = sys_time();
    int num_erased = 0;
    for (int j=0; j < num; ++j)
    {
        if (element_set.erase(search_elements[j]))
            ++num_erased;
    }
    cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}

int main()
{
    const int num_elements = 200000;
    vector<int> elements(num_elements);
    for (int j=0; j < num_elements; ++j)
        elements[j] = j;
    random_shuffle(elements.begin(), elements.end());

    vector<int> search_elements = elements;
    random_shuffle(search_elements.begin(), search_elements.end());

    typedef std::set<int> Set1;
    typedef SkipSet<int> Set2;

    cout << fixed << setprecision(3);
    for (int j=0; j < 2; ++j)
    {
        cout << "std::set" << endl;
        benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;

        cout << "SkipSet" << endl;
        benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;
    }
}

* I also made a minor tweak to random_level to make it simply assume a probability of 50% after discovering that this seems to work quite well.

By using a fixed allocator, we can quickly allocate and deallocate elements using a very simple constant-time strategy, and more importantly, allocate nodes in a way such that nodes with the same height (the most frequently accessed together) are allocated more contiguously with respect to each other (though not in an ideal sequential order). This improves the times to:

Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.103632 secs

Search:
-- std::set: 0.078351 secs
-- SkipList: 0.089910 secs

Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.089224 secs

... which isn't bad for about 40 minutes of work against std::set which has been poked and prodded and tuned by experts all over the industry. We also have faster removals than std::set (insertion times fluctuate a bit on my machine, I'd say they are roughly on par).

Of course we cheated to apply this last step. Using a custom allocator tilts things in our favor, and also changes the characteristics of the container in a way such that its memory cannot be freed until it is cleared, destroyed, or compacted. It can mark the memory as unused and reclaim it upon subsequent insertions, but it cannot make the memory it uses available for those outside the skip list to use. I also didn't bother to pay attention to proper alignment for all possible types of T which would be necessary to truly generalize it.

Sorted Input

Let's try using this against sorted input. To do so, simply comment out the two random_shuffle statements. On my end, I now get these times:

std::set
-- Inserted 200000 elements in 0.044 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.019 secs

SkipSet
-- Inserted 200000 elements in 0.027 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.016 secs

... and now our SkipSet outperforms std::set in all areas, though just for this one kind of exceptional case.

Conclusion

So I don't know exactly what this says about skip lists. With hardly any time and effort, we got pretty close to rivaling std::set*. Yet we didn't beat it, and we had to cheat with a memory allocator to get really close.

* Note that mileage may vary across machines, vendor implementations of std::set, etc.

Times have changed quite a bit since the paper Pugh wrote in 1989.

Today the benefits of locality of reference dominates things to a point where even a linearithmic complexity algorithm can outperform a linear one provided that the former is considerably more cache or page-friendly. Paying close attention to the way the system grabs chunks of memory from upper levels of the memory hierarchy (ex: secondary stage) with slower but bigger memory and down to the little L1 cache line and teeny register is a bigger deal than ever before, and no longer "micro" if you ask me when the benefits can rival algorithmic improvements.

The skip list is potentially crippled here given the considerably larger size of nodes, and just as importantly, the variable size of nodes (which makes them difficult to allocate very efficiently).

One thing we didn't look at is the localized nature in which a skip list updates on insertion. This tends to impact much fewer areas than a balancing tree requires to rebalance by rotating parent nodes. As a result, a skip list can offer a potentially more efficient implementation of a concurrent set or map.

Another promising characteristic of a skip list is that it is simply a linked-list a the lowest level. As a result, it offers very simple linear-time sequential traversal without having to descend down the branches of the tree with linearithmic complexity, so it is potentially very good if we want to do a lot of linear-time iterations through all the elements contained.

But always remember:

A technique is only as good as it can be applied in the hands of the implementor.

这篇关于跳过列表,他们真的表现与Pugh纸索赔一样好吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆