跳过列表,他们真的表现好比Pugh纸索赔吗? [英] Skip lists, are they really performing as good as Pugh paper claim?

查看:284
本文介绍了跳过列表,他们真的表现好比Pugh纸索赔吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图实现一个跳过列表,执行与BST使用最小额外的内存开销一样好,目前,即使不考虑任何内存限制,我的跳过列表实现的性能是远离甚至非常朴素的平衡BST实现 - 也就是说,手工制作的BTS :) -



作为参考,我使用的原始论文从William Pugh





最后但并非最不重要的是,内存使用情况图,您可能会期望确认,如果我们增加每个节点的级别数量,我们会显着影响内存指纹。



$ b所有这一切,你们中的任何人可以提供一个评论如何实现一个跳过列表性能一样好 - 不计数额外的内存开销 - ?



我知道DrDobbs的文章 LINK 关于SkipList,并且我通过所有的论文,搜索和插入操作的代码完全匹配从 PUG89 所以应该和我一样好,而且文章并没有提供任何性能分析,我没有找到任何其他来源。你能帮助我吗?



感谢您的任何评论!



:)

解决方案

历史



Pugh写了他的原始文件。我们在他的论文中没有提到关于CPU和操作系统的存储器层次结构,这已经成为当今流行的焦点(现在经常与算法复杂度同样重要)。



他的基准测试的输入案例有2 ^ 16个元素,而硬件往往最多有32位扩展内存寻址。这使得指针的大小比我们今天在64位机器上使用的大小小一半或更小。同时,例如,字符串字段可以同样大,使得存储在跳过列表中的元素与跳过节点所需的指针之间的比率潜在地更小,特别是考虑到我们经常需要每个跳过节点的指针数量。



C编译器在对寄存器分配和指令选择等事情进行优化时没有那么激进。甚至平均的手写装配往往可以提供一个显着的性能优势。编译器提示像 register inline 实际上在那些时候做了一个很大的事情。虽然这可能看起来有点笨拙,因为平衡的BST和跳过列表实现将在这里平等,甚至基本循环的优化是一个更手动的过程。当优化是越来越多的手动过程时,更容易实现的东西通常更容易优化。跳过列表通常被认为比平衡树更容易实现。



所有这些因素都可能在Pugh的结论中有一部分。但是时代已经改变:硬件已经改变,操作系统已经改变,编译器已经改变,已经对这些主题进行了更多的研究。



实现



有了这一点,让我们有一些乐趣,并实现一个基本的跳过列表。我最后调整了这里懒惰的实施。这是一个运行的实现,几乎不同于今天很多容易访问的示例性跳过列表实现。



我们将比较我们的实现对 std :: set 的执行,它几乎总是实现为一个红黑树*。



*有些人可能会想知道为什么我使用 0 而不是 nullptr 这是一种习惯。在我的工作场所,我们仍然需要编写开放的库,目标范围广泛的编译器,包括只支持C ++ 03的编译器,所以我仍然习惯于编写低/中级实现代码,有时甚至在C,所以请原谅我写这段代码的旧风格。

  #include< iostream> 
#include< algorithm>
#include< cstdlib>
#include< ctime>
#include< cmath>
#include< vector>
#include< cassert>
#include< cstring>
#include< set>

using namespace std;

static const int max_level = 32;
static const float probability = 0.5;

static double sys_time()
{
return static_cast< double>(clock())/ CLOCKS_PER_SEC;
}

static int random_level()
{
int lvl = 1;
while((static_cast< float>(rand())/ RAND_MAX)< probability&& lvl< max_level)
++ lvl;
return lvl;
}

template< class T>
class SkipSet
{
public:
SkipSet():head(0)
{
head = create_node(max_level,T
level = 0;
}

〜SkipSet()
{
while(head)
{
Node * to_destroy =
head = head-> next [0];
destroy_node(to_destroy);
}
}

bool contains(const T& value)const
{
const Node * node = head;
for(int i = level; i> = 0; --i)
{
while(node-> next [i]&& node-& i] - > value< value)
node = node-> next [i]
}
node = node-> next [0];
return node&&& node-> value == value;
}

void insert(const T& value)
{
Node * node = head;
Node * update [max_level + 1];
memset(update,0,sizeof(Node *)*(max_level +1));

for(int i = level; i> = 0; i--)
{
while(node-> next [i]&& > next [i] - > value< value)
node = node-> next [i]
update [i] = node;
}
node = node-> next [0];

if(!node || node-> value!= value)
{
int lvl = random_level
assert(lvl> = 0);
if(lvl> level)
{
for(int i = level + 1; i <= lvl; i ++){
update [i]
}
level = lvl;
}
node = create_node(lvl,value);
for(int i = 0; i <= lvl; i ++){
node-> next [i] = update [i] - > next [i]
update [i] - > next [i] = node;
}
}
}

bool erase(const T& value)
{
Node * node = head;
Node * update [max_level + 1];
memset(update,0,sizeof(Node *)*(max_level + 1));

for(int i = level; i> = 0; i--)
{
while(node-> next [i]&& > next [i] - > value< value)
node = node-> next [i]
update [i] = node;
}
node = node-> next [0];

if(node-> value == value)
{
for(int i = 0; i <= level; i ++){
if update [i] - > next [i]!= node)
break;
update [i] - > next [i] = node-> next [i]
}
destroy_node(node);
while(level> 0&&!head-> next [level])
--level;
return true;
}
return false;
}

private:
struct Node
{
T value;
struct Node ** next;
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = malloc(sizeof(Node));
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value);

void * next_mem = calloc(level + 1,sizeof(Node *));
new_node-> next = static_cast< Node **>(next_mem);
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
free(node-> next);
free(node);
}

节点*头;
int level;
};

template< class T>
bool contains(const std :: set< T>& cont,const T& val)
{
return cont.find(val)!= cont.end();
}

template< class T>
bool contains(const SkipSet< T& cont,const T& val)
{
return cont.contains
}

template< class Set,class T>
void benchmark(int num,const T * elements,const T * search_elements)
{
const double start_insert = sys_time();
设置element_set;
for(int j = 0; j element_set.insert(elements [j]);
cout<< - Inserted< num<< 元素在< (sys_time() - start_insert)<< secs< endl;

const double start_search = sys_time();
int num_found = 0;
for(int j = 0; j {
if(contains(element_set,search_elements [j]))
++ num_found;
}
cout<< - Found< num_found< 元素在< (sys_time() - start_search)<< secs< endl;

const double start_erase = sys_time();
int num_erased = 0;
for(int j = 0; j {
if(element_set.erase(search_elements [j]))
++ num_erased;
}
cout<< - Erased< num_found< 元素在< (sys_time() - start_erase)<< secs< endl;
}

int main()
{
const int num_elements = 200000;
vector< int>元素(num_elements);
for(int j = 0; j elements [j] = j;
random_shuffle(elements.begin(),elements.end());

vector< int> search_elements = elements;
random_shuffle(search_elements.begin(),search_elements.end());

typedef std :: set< int> Set1;
typedef SkipSet< int> Set2;

for(int j = 0; j< 3; ++ j)
{
cout< std :: set< endl;
benchmark< Set1>(num_elements,& elements [0],& search_elements [0]);
cout<< endl;

cout<< SkipSet<< endl;
benchmark< Set2>(num_elements,& elements [0],& search_elements [0]);
cout<< endl;
}
}

在GCC 5.2,-O2上,

  std :: set 
- 在0.104869秒内插入200000个元素
- 找到200000个元素0.078351秒
- 在0.098208秒中擦除了200000个元素

SkipSet
- 在0.188765秒内插入200000个元素
- 在0.160895秒中找到200000个元素
- 在0.162444秒中擦除了200000个元素

...这是非常可怕的。



优化



但是有一个明显的优化可以使。如果我们看一下 Node ,它的当前字段如下所示:

  struct Node 
{
T value;
struct Node ** next;
};

这意味着Node字段的内存及其下一个指针列表是两个单独的块,可能它们之间的距离非常远:

  [节点字段] ------------ --------> [next0,next1,...,null] 

这对于参考的局部性不好。如果我们想改进这里的事情,我们应该将这些内存块合并成一个连续的结构,如下:

  ,next0,next1,...,null] 



我们可以通过变长结构在C ++中实现它有点尴尬,它不能直接支持它,但我们可以模拟效果如下:

  struct Node 
{
T value;
struct Node * next [1];
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = malloc(sizeof(Node)+ level * sizeof(Node *));
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value);
for(int j = 0; j< level + 1; ++ j)
new_node-> next [j] = 0;
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
free(node);
}

有了这个适度的调整,我们现在有这些时间:

  SkipSet(之前)
- 在0.188765秒内插入200000个元素
- 在0.160895秒中找到200000个元素
- 在0.162444秒中擦除200000个元素

SkipSet(之后)
- 在0.132322秒内插入200000个元素
- 在0.127989秒中找到200000个元素
- - 在0.130889秒内擦除200000个元素

...这使我们更接近竞争对手的性能 std :: set



随机数生成器



一个真正高效的跳过列表实现通常需要一个非常快的RNG。然而,我发现在一个快速剖析会话,只有一个非常少部分的时间花费在生成一个随机的水平/高度,几乎不足以认为它是一个热点。除非提供更均匀的分布,否则它也只会影响插入时间,因此我决定跳过此优化。



内存分配器



在这一点上,我们会对跳过列表实现和BST的期望有一个非常合理的概述:

 插入
- std :: set:0.104869 secs
- SkipList:0.132322 secs

搜索:
- - std :: set:0.078351 secs
- SkipList:0.127989 secs

删除:
- std :: set:0.098208 secs
- SkipList:0.130889 secs

然而,如果我们想要更进一步,我们可以利用一个固定的分配器。在这一点上,我们作弊一点,因为 std :: set 被设计用于符合标准分配器的接口要求的任何通用分配器。但是让我们来看看使用固定分配器:

  #include< iostream> 
#include< iomanip>
#include< algorithm>
#include< cstdlib>
#include< ctime>
#include< cmath>
#include< vector>
#include< cassert>
#include< set>

using namespace std;

static const int max_level = 32;

class FixedAlloc
{
public:
FixedAlloc():root_block(0),free_element(0),type_size(0),block_size (0),free_element(0),type_size(0),block_size(0),block_size(0),block_size(0) block_num(0)
{
init(itype_size,iblock_size);
}

〜FixedAlloc()
{
purge();
}

void init(int new_type_size,int new_block_size)
{
purge();
block_size = max(new_block_size,type_size);
type_size = max(new_type_size,static_cast< int>(sizeof(FreeElement)));
block_num = block_size / type_size;
}

void purge()
{
while(root_block)
{
Block * block = root_block;
root_block = root_block-> next;
free(block);
}
free_element = 0;
}

void * allocate()
{
assert(type_size> 0);
if(free_element)
{
void * mem = free_element;
free_element = free_element-> next_element;
return mem;
}

//创建新块。
void * new_block_mem = malloc(sizeof(Block) - 1 + type_size * block_num);
Block * new_block = static_cast< Block *>(new_block_mem);
new_block-> next = root_block;
root_block = new_block;

//将除了一个新块的所有元素之外的所有元素推送到空闲池。
char * mem = new_block-> mem;
for(int j = 1; j {
FreeElement * element = reinterpret_cast< FreeElement *>(mem + j * type_size);
element-> next_element = free_element;
free_element = element;
}
return mem;
}

void deallocate(void * mem)
{
FreeElement * element = static_cast< FreeElement *>
element-> next_element = free_element;
free_element = element;
}

void swap(FixedAlloc& other)
{
std :: swap(free_element,other.free_element);
std :: swap(root_block,other.root_block);
std :: swap(type_size,other.type_size);
std :: swap(block_size,other.block_size);
std :: swap(block_num,other.block_num);
}

private:
struct Block
{
Block * next;
char mem [1];
};
struct FreeElement
{
struct FreeElement * next_element;
};

//禁止复制。
FixedAlloc(const FixedAlloc&);
FixedAlloc& operator =(const FixedAlloc&);

struct Block * root_block;
struct FreeElement * free_element;
int type_size;
int block_size;
int block_num;
};

static double sys_time()
{
return static_cast< double>(clock())/ CLOCKS_PER_SEC;
}

static int random_level()
{
int lvl = 1;
while(rand()%2 == 0&& lvl< max_level)
++ lvl;
return lvl;
}

template< class T>
class SkipSet
{
public:
SkipSet():head(0)
{
for(int j = 0; j head = create_node(max_level,T());
level = 0;
}

〜SkipSet()
{
while(head)
{
Node * to_destroy =
head = head-> next [0];
destroy_node(to_destroy);
}
}

bool contains(const T& value)const
{
const Node * node = head;
for(int i = level; i> = 0; --i)
{
while(node-> next [i]&& node-& i] - > value< value)
node = node-> next [i]
}
node = node-> next [0];
return node&&& node-> value == value;
}

void insert(const T& value)
{
Node * node = head;
Node * update [max_level + 1] = {0};
for(int i = level; i> = 0; --i)
{
while(node-> next [i]&& node-& i] - > value< value)
node = node-> next [i]
update [i] = node;
}
node = node-> next [0];

if(!node || node-> value!= value)
{
const int lvl = random_level
assert(lvl> = 0);
if(lvl> level)
{
for(int i = level + 1; i <= lvl; ++ i)
update [i]
level = lvl;
}
node = create_node(lvl,value);
for(int i = 0; i <= lvl; ++ i)
{
node-> next [i] = update [i] ;
update [i] - > next [i] = node;
}
}
}

bool erase(const T& value)
{
Node * node = head;
Node * update [max_level + 1] = {0};
for(int i = level; i> = 0; --i)
{
while(node-> next [i]&& node-& i] - > value< value)
node = node-> next [i]
update [i] = node;
}
node = node-> next [0];

if(node-> value == value)
{
for(int i = 0; i <= level; ++ i){
if(update [i] - > next [i]!= node)
break;
update [i] - > next [i] = node-> next [i]
}
destroy_node(node);
while(level> 0&&!head-> next [level])
--level;
return true;
}
return false;
}

void swap(SkipSet< T>& other)
{
for(int j = 0; j< max_level; ++ j)
allocs [j] .swap(other.allocs [j]);
std :: swap(head,other.head);
std :: swap(level,other.level);
}

private:
struct Node
{
T value;
int num;
struct Node * next [1];
};

Node * create_node(int level,const T& new_value)
{
void * node_mem = allocs [level-1] .allocate();
Node * new_node = static_cast< Node *>(node_mem);
new(& new_node-> value)T(new_value);
new_node-> num = level;
for(int j = 0; j< level + 1; ++ j)
new_node-> next [j] = 0;
return new_node;
}

void destroy_node(Node * node)
{
node-> value。〜T();
allocs [node-> num-1] .deallocate(node);
}

FixedAlloc allocs [max_level];
Node * head;
int level;
};

template< class T>
bool contains(const std :: set< T>& cont,const T& val)
{
return cont.find(val)!= cont.end();
}

template< class T>
bool contains(const SkipSet< T& cont,const T& val)
{
return cont.contains
}

template< class Set,class T>
void benchmark(int num,const T * elements,const T * search_elements)
{
const double start_insert = sys_time();
设置element_set;
for(int j = 0; j element_set.insert(elements [j]);
cout<< - Inserted< num<< 元素在< (sys_time() - start_insert)<< secs< endl;

const double start_search = sys_time();
int num_found = 0;
for(int j = 0; j {
if(contains(element_set,search_elements [j]))
++ num_found;
}
cout<< - Found< num_found< 元素在< (sys_time() - start_search)<< secs< endl;

const double start_erase = sys_time();
int num_erased = 0;
for(int j = 0; j {
if(element_set.erase(search_elements [j]))
++ num_erased;
}
cout<< - Erased< num_found< 元素在< (sys_time() - start_erase)<< secs< endl;
}

int main()
{
const int num_elements = 200000;
vector< int>元素(num_elements);
for(int j = 0; j elements [j] = j;
random_shuffle(elements.begin(),elements.end());

vector< int> search_elements = elements;
random_shuffle(search_elements.begin(),search_elements.end());

typedef std :: set< int> Set1;
typedef SkipSet< int> Set2;

cout<<固定< setprecision(3);
for(int j = 0; j <2; ++ j)
{
cout< std :: set< endl;
benchmark< Set1>(num_elements,& elements [0],& search_elements [0]);
cout<< endl;

cout<< SkipSet<< endl;
benchmark< Set2>(num_elements,& elements [0],& search_elements [0]);
cout<< endl;
}
}

*我也做了一个小小的调整 random_level ,以使其在发现这似乎工作得很好后简单地假设概率为50%。



通过使用固定分配器,我们可以使用非常简单的恒定时间策略快速分配和释放元素,更重要的是,以一种方式分配节点,使得具有相同高度(最常访问在一起)的节点被分配相对于彼此更连续(尽管不是以理想的顺序次序)。这将时间改进为:

 插入
- std :: set:0.104869 secs
- SkipList:0.103632 secs

搜索:
- std :: set:0.078351 secs
- SkipList:0.089910 secs

删除:
-std :: set:0.098208 secs
- SkipList:0.089224 secs

。 ..这是不坏的大约40分钟的工作对 std :: set 已被戳,并由行业内的专家调整和调整。我们还有比 std :: set 更快的删除(插入时间在我的机器上有点波动,我想说他们大致在par)。



当然,我们欺骗了最后一步。 Using a custom allocator tilts things in our favor, and also changes the characteristics of the container in a way such that its memory cannot be freed until it is cleared, destroyed, or compacted. It can mark the memory as unused and reclaim it upon subsequent insertions, but it cannot make the memory it uses available for those outside the skip list to use. I also didn’t bother to pay attention to proper alignment for all possible types of T which would be necessary to truly generalize it.



Sorted Input



Let’s try using this against sorted input. To do so, simply comment out the two random_shuffle statements. On my end, I now get these times:

std::set 
-- Inserted 200000 elements in 0.044 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.019 secs

SkipSet
-- Inserted 200000 elements in 0.027 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.016 secs

... and now our SkipSet outperforms std::set in all areas, though just for this one kind of exceptional case.



Conclusion



So I don’t know exactly what this says about skip lists. With hardly any time and effort, we got pretty close to rivaling std::set*. Yet we didn’t beat it, and we had to cheat with a memory allocator to get really close.



* Note that mileage may vary across machines, vendor implementations of std::set, etc.



Times have changed quite a bit since the paper Pugh wrote in 1989.



Today the benefits of locality of reference dominates things to a point where even a linearithmic complexity algorithm can outperform a linear one provided that the former is considerably more cache or page-friendly. Paying close attention to the way the system grabs chunks of memory from upper levels of the memory hierarchy (ex: secondary stage) with slower but bigger memory and down to the little L1 cache line and teeny register is a bigger deal than ever before, and no longer \"micro\" if you ask me when the benefits can rival algorithmic improvements.



The skip list is potentially crippled here given the considerably larger size of nodes, and just as importantly, the variable size of nodes (which makes them difficult to allocate very efficiently).



One thing we didn’t look at is the localized nature in which a skip list updates on insertion. This tends to impact much fewer areas than a balancing tree requires to rebalance by rotating parent nodes. As a result, a skip list can offer a potentially more efficient implementation of a concurrent set or map.



Another promising characteristic of a skip list is that it is simply a linked-list a the lowest level. As a result, it offers very simple linear-time sequential traversal without having to descend down the branches of the tree with linearithmic complexity, so it is potentially very good if we want to do a lot of linear-time iterations through all the elements contained.



But always remember:



A technique is only as good as it can be applied in the hands of the implementor.


I'm trying to implement a skip list that perform as good as a BST using a minimum additional memory overhead, at the moment even not considering any memory restriction the performance of my SkipList implementation is far away from even a very naive Balanced BST implementation -so to say, handcrafted BTS :) -

As reference I'm using the original paper from William Pugh PUG89 and the implementation i found in Algorithms in C from Sedgewick -13.5-. My code is a recursive implementation, here's the clue of the insert and find operation:

sl_node* create_node()
{
    short lvl {1};
    while((dist2(en)<p)&&(lvl<max_level))
        ++lvl;
    return new sl_node(lvl);
}

void insert_impl(sl_node* cur_node,
        sl_node* new_node,
        short lvl)
{
    if(cur_node->next_node[lvl]==nullptr || cur_node->next_node[lvl]->value > new_node->value){
        if(lvl<new_node->lvl){
            new_node->next_node[lvl] = cur_node->next_node[lvl];
            cur_node->next_node[lvl] = new_node;
        }
        if(lvl==0) return;
        insert_impl(cur_node,new_node,lvl-1);
        return;
    }
    insert_impl(cur_node->next_node[lvl],new_node,lvl);
}
sl_node* insert(long p_val)
{
    sl_node* new_node = create_node();
    new_node->value = p_val;
    insert_impl(head, new_node,max_level-1);
    return new_node;
}

And this is the code for the find operation:

sl_node* find_impl(sl_node* cur_node,
        long p_val,
        int lvl)
{
    if(cur_node==nullptr) return nullptr;
    if(cur_node->value==p_val) return cur_node;
    if(cur_node->next_node[lvl] == nullptr || cur_node->next_node[lvl]->value>p_val){
        if(lvl==0) return nullptr;
        return find_impl(cur_node,p_val,lvl-1);
    }
    return find_impl(cur_node->next_node[lvl],p_val,lvl);
}

sl_node* find(long p_val)
{
    return find_impl(head,p_val,max_level-1);
}

A sl_node -skip list node- looks like this:

struct sl_node
{
    long  value;
    short lvl;
    sl_node** next_node;

    sl_node(int l) : lvl(l)
    {
        next_node = new sl_node*[l];
        for(short i{0};i<l;i++)
            next_node[i]=nullptr;
    }
    ~sl_node()
    {
        delete[] next_node;
    }
};

As you can see this is implementation has nothing special nor advanced if compared to the book implementation, i will not share the Balaced BTS code since I don't think is needed here, but is a very basic BTS with a rebalance function triggered during the insertion when the new node height is greather than 16*lg(n) where n is the number of nodes.

So to say, i rebalance the three only if the maximum height is 16 times greather than the best theoretical one, as i said, this straighforward homemade BST performs much better than the homemade Skip List.

But first, let's have a look at some data, using p=.5 and n=262144 the level of the nodes in the SkipList has the following distribution:

1:141439, 53.9547%
2:65153, 24.8539%
3:30119, 11.4895%
4:13703, 5.22728%
5:6363, 2.42729%
6:2895, 1.10435%
7:1374, 0.524139%
8:581, 0.221634%
9:283, 0.107956%
10:117, 0.044632%
11:64, 0.0244141%
12:31, 0.0118256%
13:11, 0.00419617%
14:5, 0.00190735%
15:1, 0.00038147%
16:5, 0.00190735%

Which almost perfectly -oh, this is big one!- match the theory from the article, that is: 50% level 1, 25% level 2 and so on and so forth. The input data came from my best available pseudo-random number generator aka std::random_device with std::default_random_engine and an uniform int distribution. The input looks pretty random to me :) !

The time required to search for 'all' the 262144 elements in the SkipList -in random order- is 315ms on my machine, whereas for the same search operation on the naive BTS the required time is 134ms, so the BTS is almost twice faster than the SkipList. That is not exactly what i was expecting from "Skip list algoriths have the same asymptotic expected time bounds as balanced trees and are simple, faster and use less space" PUG89.

The time required for the 'insertion' of the nodes is 387ms for the SkipList and 143ms for the BTS, again a naive BST perform better.

The things get little more interesting if instead of using a random sequence of input number i use a sorted sequence, here my poor homemade BST become slow, and the insertion of 262144 sorted int's takes 2866ms whereas the SkipList require only 168ms.

But, when come to the search time, the BST is still faster! for the sorted input we have 234ms versus 77ms, this BST is 3x faster.

With different values of the p-factor i got slightly different performance results:

And last but not least, the memory usage graph, which as you may expect confirm that if we increase the amount of levels per node we impact significantly the memory fingerprint. The memory usage is calculated just as a sum of the space required to store the additional pointers for all the node.

After all this, does anybody of you can provide me a comment on how to implement a SkipList performing as good as a BTS -not counting the additional memory overhead-?

I know about the article from DrDobbs LINK about SkipList, and i went thru all the paper, the code for the search and insert operation match exactly the original implementation from PUG89 so should be as good as mine, and the article does not provide any performance analysis anyway, I did not found any other source. Can you help me?

Any comment is highly appreciated!

Thanks! :)

解决方案

History

Times have changed a bit since William Pugh wrote his original paper. We see no mention in his paper about the memory hierarchy of the CPU and operating system which has become such a prevalent focus today (now often equally as important as algorithmic complexity).

His input case for benchmarking had a measly 2^16 elements, and hardware back then typically had, at most, 32-bit extended memory addressing available. This made the size of a pointer half the size or smaller than what we're used to today on 64-bit machines. Meanwhile a string field, e.g., could be just as large, making the ratio between the elements stored in the skip list and the pointers required by a skip node potentially a lot smaller, especially given that we often need a number of pointers per skip node.

C Compilers weren't so aggressive at optimization back then with respect to things like register allocation and instruction selection. Even average hand-written assembly could often provide a significant benefit in performance. Compiler hints like register and inline actually made a big deal during those times. While this might seem kind of moot since both a balanced BST and skip list implementation would be on equal footing here, optimization of even a basic loop was a more manual process. When optimization is an increasingly manual process, something that is easier to implement is often easier to optimize. Skip lists are often considered to be a lot easier to implement than a balancing tree.

So all of these factors probably had a part in Pugh's conclusions at the time. Yet times have changed: hardware has changed, operating systems have changed, compilers have changed, more research has been done into these topics, etc.

Implementation

With that aside, let's have some fun and implement a basic skip list. I ended up adapting the implementation available here out of laziness. It's a run-of-the-mill kind of implementation, hardly different from the plethora of easily-accessible exemplary skip list implementations out there today.

We'll be comparing the performance of our implementation against std::set which is almost always implemented as a red-black tree*.

* Some might wonder why I use 0 instead of nullptr and things of that sort. It's a habit. At my workplace, we still have to write open libraries that target a wide range of compilers including those that only support C++03, so I'm still used to writing lower/mid-level implementation code that way, and sometimes even in C, so please forgive the old style in which I wrote this code.

#include <iostream>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <cstring>
#include <set>

using namespace std;

static const int max_level = 32;
static const float probability = 0.5;

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

static int random_level() 
{
    int lvl = 1;
    while ((static_cast<float>(rand()) / RAND_MAX) < probability && lvl < max_level)
        ++lvl;
    return lvl;
}

template <class T>
class SkipSet
{
public:
    SkipSet(): head(0)
    {
        head = create_node(max_level, T());
        level = 0;
    }

    ~SkipSet()
    {
        while (head)
        {
            Node* to_destroy = head;
            head = head->next[0];
            destroy_node(to_destroy);
        }
    }

    bool contains(const T& value) const
    {
        const Node* node = head;
        for (int i=level; i >= 0; --i)
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
        }
        node = node->next[0];
        return node && node->value == value;
    }

    void insert(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1];
        memset(update, 0, sizeof(Node*)*(max_level + 1));

        for (int i = level; i >= 0; i--) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (!node || node->value != value)
        {
            int lvl = random_level();
            assert(lvl >= 0);
            if (lvl > level) 
            {
                for (int i = level + 1; i <= lvl; i++) {
                    update[i] = head;
                }
                level = lvl;
            }
            node = create_node(lvl, value);
            for (int i = 0; i <= lvl; i++) {
                node->next[i] = update[i]->next[i];
                update[i]->next[i] = node;
            }            
        }
    }

    bool erase(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1];
        memset(update, 0, sizeof(Node*)*(max_level + 1));

        for (int i = level; i >= 0; i--) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (node->value == value)
        {
            for (int i = 0; i <= level; i++) {
                if (update[i]->next[i] != node)
                    break;
                update[i]->next[i] = node->next[i];
            }
            destroy_node(node);
            while (level > 0 && !head->next[level])
                --level;
            return true;
        }
        return false;
    }

private:
    struct Node
    {
        T value;
        struct Node** next;
    };

    Node* create_node(int level, const T& new_value)
    {
        void* node_mem = malloc(sizeof(Node));
        Node* new_node = static_cast<Node*>(node_mem);
        new (&new_node->value) T(new_value);

        void* next_mem = calloc(level+1, sizeof(Node*));
        new_node->next = static_cast<Node**>(next_mem);
        return new_node;
    }

    void destroy_node(Node* node)
    {
        node->value.~T();
        free(node->next);
        free(node);
    }

    Node* head;
    int level;
};

template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
    return cont.find(val) != cont.end();
}

template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
    return cont.contains(val);
}

template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
    const double start_insert = sys_time();
    Set element_set;
    for (int j=0; j < num; ++j)
        element_set.insert(elements[j]);
    cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;

    const double start_search = sys_time();
    int num_found = 0;
    for (int j=0; j < num; ++j)
    {
        if (contains(element_set, search_elements[j]))
            ++num_found;
    }
    cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;

    const double start_erase = sys_time();
    int num_erased = 0;
    for (int j=0; j < num; ++j)
    {
        if (element_set.erase(search_elements[j]))
            ++num_erased;
    }
    cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}

int main()
{
    const int num_elements = 200000;
    vector<int> elements(num_elements);
    for (int j=0; j < num_elements; ++j)
        elements[j] = j;
    random_shuffle(elements.begin(), elements.end());

    vector<int> search_elements = elements;
    random_shuffle(search_elements.begin(), search_elements.end());

    typedef std::set<int> Set1;
    typedef SkipSet<int> Set2;

    for (int j=0; j < 3; ++j)
    {
        cout << "std::set" << endl;
        benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;

        cout << "SkipSet" << endl;
        benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;
    }
}

On GCC 5.2, -O2, I get this:

std::set
-- Inserted 200000 elements in 0.104869 secs
-- Found 200000 elements in 0.078351 secs
-- Erased 200000 elements in 0.098208 secs

SkipSet
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs

... which is pretty awful. We're around twice as slow all across the board.

Optimization

Yet there is a glaring optimization we can make. If we look at Node, its current fields look like this:

struct Node
{
    T value;
    struct Node** next;
};

This implies that the memory for the Node fields and its list of next pointers are two separate blocks, possibly with a very distant stride between them like so:

    [Node fields]-------------------->[next0,next1,...,null]

This does poorly for locality of reference. If we want to improve things here, we should merge these memory blocks into a single contiguous structure, like so:

    [Node fields,next0,next1,...,null]

We can achieve this through the variable-length struct idiom common in C. It's a little bit awkward to implement in C++ which doesn't support it so directly, but we can emulate the effect like so:

struct Node
{
    T value;
    struct Node* next[1];
};

Node* create_node(int level, const T& new_value)
{
    void* node_mem = malloc(sizeof(Node) + level * sizeof(Node*));
    Node* new_node = static_cast<Node*>(node_mem);
    new (&new_node->value) T(new_value);
    for (int j=0; j < level+1; ++j)
        new_node->next[j] = 0;
    return new_node;
}

void destroy_node(Node* node)
{
    node->value.~T();
    free(node);
}

With this modest tweak, we now have these times:

SkipSet (Before)
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs

SkipSet (After)
-- Inserted 200000 elements in 0.132322 secs
-- Found 200000 elements in 0.127989 secs
-- Erased 200000 elements in 0.130889 secs

... which gets us considerably closer to rivaling the performance of std::set.

Random Number Generator

A truly efficient skip list implementation will generally want a very fast RNG. Nevertheless, I found during a quick profiling session that only a very teeny portion of time is spent generating a random level/height, hardly enough to consider it much of a hotspot. It would also only impact the insertion times unless it provided a more uniform distribution, so I've decided to skip this optimization.

Memory Allocator

At this point, I'd say we have a pretty reasonable overview of what we can expect with a skip list implementation vs. a BST:

Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.132322 secs

Search:
-- std::set: 0.078351 secs
-- SkipList: 0.127989 secs

Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.130889 secs

However, if we want to soldier on a little further, we can utilize a fixed allocator. At this point, we're cheating a little bit as std::set is designed to work with any general-purpose allocator which conforms to the interface requirements of a standard allocator. But let's have a look at using a fixed allocator:

#include <iostream>
#include <iomanip>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <set>

using namespace std;

static const int max_level = 32;

class FixedAlloc
{
public:
    FixedAlloc(): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
    {
    }

    FixedAlloc(int itype_size, int iblock_size): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
    {
        init(itype_size, iblock_size);
    }

    ~FixedAlloc()
    {
        purge();
    }

    void init(int new_type_size, int new_block_size)
    {
        purge();
        block_size = max(new_block_size, type_size);
        type_size = max(new_type_size, static_cast<int>(sizeof(FreeElement)));
        block_num = block_size / type_size;
    }

    void purge()
    {
        while (root_block)
        {
            Block* block = root_block;
            root_block = root_block->next;
            free(block);
        }
        free_element = 0;
    }

    void* allocate()
    {
        assert(type_size > 0);
        if (free_element)
        {
            void* mem = free_element;
            free_element = free_element->next_element;
            return mem;
        }

        // Create new block.
        void* new_block_mem = malloc(sizeof(Block) - 1 + type_size * block_num);
        Block* new_block = static_cast<Block*>(new_block_mem);
        new_block->next = root_block;
        root_block = new_block;

        // Push all but one of the new block's elements to the free pool.
        char* mem = new_block->mem;
        for (int j=1; j < block_num; ++j)
        {
            FreeElement* element = reinterpret_cast<FreeElement*>(mem + j * type_size);
            element->next_element = free_element;
            free_element = element;
        }
        return mem;
    }

    void deallocate(void* mem)
    {
        FreeElement* element = static_cast<FreeElement*>(mem);
        element->next_element = free_element;
        free_element = element;
    }

    void swap(FixedAlloc& other)
    {
        std::swap(free_element, other.free_element);
        std::swap(root_block, other.root_block);
        std::swap(type_size, other.type_size);
        std::swap(block_size, other.block_size);
        std::swap(block_num, other.block_num);
    }

private:
    struct Block
    {
        Block* next;
        char mem[1];
    };
    struct FreeElement
    {
        struct FreeElement* next_element;
    };

    // Disable copying.
    FixedAlloc(const FixedAlloc&);
    FixedAlloc& operator=(const FixedAlloc&);

    struct Block* root_block;
    struct FreeElement* free_element;
    int type_size;
    int block_size;
    int block_num;
};

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

static int random_level()
{
    int lvl = 1;
    while (rand()%2 == 0 && lvl < max_level)
        ++lvl;
    return lvl;
}

template <class T>
class SkipSet
{
public:
    SkipSet(): head(0)
    {
        for (int j=0; j < max_level; ++j)
            allocs[j].init(sizeof(Node) + (j+1)*sizeof(Node*), 4096);
        head = create_node(max_level, T());
        level = 0;
    }

    ~SkipSet()
    {
        while (head)
        {
            Node* to_destroy = head;
            head = head->next[0];
            destroy_node(to_destroy);
        }
    }

    bool contains(const T& value) const
    {
        const Node* node = head;
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
        }
        node = node->next[0];
        return node && node->value == value;
    }

    void insert(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1] = {0};
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (!node || node->value != value)
        {
            const int lvl = random_level();
            assert(lvl >= 0);
            if (lvl > level) 
            {
                for (int i = level + 1; i <= lvl; ++i)
                    update[i] = head;
                level = lvl;
            }
            node = create_node(lvl, value);
            for (int i = 0; i <= lvl; ++i) 
            {
                node->next[i] = update[i]->next[i];
                update[i]->next[i] = node;
            }            
        }
    }

    bool erase(const T& value)
    {
        Node* node = head;  
        Node* update[max_level + 1] = {0};
        for (int i=level; i >= 0; --i) 
        {
            while (node->next[i] && node->next[i]->value < value)
                node = node->next[i];
            update[i] = node; 
        }
        node = node->next[0];

        if (node->value == value)
        {
            for (int i=0; i <= level; ++i) {
                if (update[i]->next[i] != node)
                    break;
                update[i]->next[i] = node->next[i];
            }
            destroy_node(node);
            while (level > 0 && !head->next[level])
                --level;
            return true;
        }
        return false;
    }

    void swap(SkipSet<T>& other)
    {
        for (int j=0; j < max_level; ++j)
            allocs[j].swap(other.allocs[j]);
        std::swap(head, other.head);
        std::swap(level, other.level);
    }

private:
    struct Node
    {
        T value;
        int num;
        struct Node* next[1];
    };

    Node* create_node(int level, const T& new_value)
    {
        void* node_mem = allocs[level-1].allocate();
        Node* new_node = static_cast<Node*>(node_mem);
        new (&new_node->value) T(new_value);
        new_node->num = level;
        for (int j=0; j < level+1; ++j)
            new_node->next[j] = 0;
        return new_node;
    }

    void destroy_node(Node* node)
    {
        node->value.~T();
        allocs[node->num-1].deallocate(node);
    }

    FixedAlloc allocs[max_level];
    Node* head;
    int level;
};

template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
    return cont.find(val) != cont.end();
}

template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
    return cont.contains(val);
}

template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
    const double start_insert = sys_time();
    Set element_set;
    for (int j=0; j < num; ++j)
        element_set.insert(elements[j]);
    cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;

    const double start_search = sys_time();
    int num_found = 0;
    for (int j=0; j < num; ++j)
    {
        if (contains(element_set, search_elements[j]))
            ++num_found;
    }
    cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;

    const double start_erase = sys_time();
    int num_erased = 0;
    for (int j=0; j < num; ++j)
    {
        if (element_set.erase(search_elements[j]))
            ++num_erased;
    }
    cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}

int main()
{
    const int num_elements = 200000;
    vector<int> elements(num_elements);
    for (int j=0; j < num_elements; ++j)
        elements[j] = j;
    random_shuffle(elements.begin(), elements.end());

    vector<int> search_elements = elements;
    random_shuffle(search_elements.begin(), search_elements.end());

    typedef std::set<int> Set1;
    typedef SkipSet<int> Set2;

    cout << fixed << setprecision(3);
    for (int j=0; j < 2; ++j)
    {
        cout << "std::set" << endl;
        benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;

        cout << "SkipSet" << endl;
        benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
        cout << endl;
    }
}

* I also made a minor tweak to random_level to make it simply assume a probability of 50% after discovering that this seems to work quite well.

By using a fixed allocator, we can quickly allocate and deallocate elements using a very simple constant-time strategy, and more importantly, allocate nodes in a way such that nodes with the same height (the most frequently accessed together) are allocated more contiguously with respect to each other (though not in an ideal sequential order). This improves the times to:

Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.103632 secs

Search:
-- std::set: 0.078351 secs
-- SkipList: 0.089910 secs

Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.089224 secs

... which isn't bad for about 40 minutes of work against std::set which has been poked and prodded and tuned by experts all over the industry. We also have faster removals than std::set (insertion times fluctuate a bit on my machine, I'd say they are roughly on par).

Of course we cheated to apply this last step. Using a custom allocator tilts things in our favor, and also changes the characteristics of the container in a way such that its memory cannot be freed until it is cleared, destroyed, or compacted. It can mark the memory as unused and reclaim it upon subsequent insertions, but it cannot make the memory it uses available for those outside the skip list to use. I also didn't bother to pay attention to proper alignment for all possible types of T which would be necessary to truly generalize it.

Sorted Input

Let's try using this against sorted input. To do so, simply comment out the two random_shuffle statements. On my end, I now get these times:

std::set
-- Inserted 200000 elements in 0.044 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.019 secs

SkipSet
-- Inserted 200000 elements in 0.027 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.016 secs

... and now our SkipSet outperforms std::set in all areas, though just for this one kind of exceptional case.

Conclusion

So I don't know exactly what this says about skip lists. With hardly any time and effort, we got pretty close to rivaling std::set*. Yet we didn't beat it, and we had to cheat with a memory allocator to get really close.

* Note that mileage may vary across machines, vendor implementations of std::set, etc.

Times have changed quite a bit since the paper Pugh wrote in 1989.

Today the benefits of locality of reference dominates things to a point where even a linearithmic complexity algorithm can outperform a linear one provided that the former is considerably more cache or page-friendly. Paying close attention to the way the system grabs chunks of memory from upper levels of the memory hierarchy (ex: secondary stage) with slower but bigger memory and down to the little L1 cache line and teeny register is a bigger deal than ever before, and no longer "micro" if you ask me when the benefits can rival algorithmic improvements.

The skip list is potentially crippled here given the considerably larger size of nodes, and just as importantly, the variable size of nodes (which makes them difficult to allocate very efficiently).

One thing we didn't look at is the localized nature in which a skip list updates on insertion. This tends to impact much fewer areas than a balancing tree requires to rebalance by rotating parent nodes. As a result, a skip list can offer a potentially more efficient implementation of a concurrent set or map.

Another promising characteristic of a skip list is that it is simply a linked-list a the lowest level. As a result, it offers very simple linear-time sequential traversal without having to descend down the branches of the tree with linearithmic complexity, so it is potentially very good if we want to do a lot of linear-time iterations through all the elements contained.

But always remember:

A technique is only as good as it can be applied in the hands of the implementor.

这篇关于跳过列表,他们真的表现好比Pugh纸索赔吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆