使用Boost查找大图的连接组件 [英] Find connected components of big Graph using Boost

查看:120
本文介绍了使用Boost查找大图的连接组件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个代码,用于查找一个非常大的Graph(8000万条边)
的连通分量,但它不起作用,当边数接近4000万时,它会崩溃。

b
$ b

  int main(){
using namespace boost;
{
int node1,node2;
typedef adjacency_list< vecS,vecS,undirectedS>图形;
图形G;
std :: ifstream infile(pairs.txt);
std :: string line;
while(std :: getline(infile,line))
{
std :: istringstream iss(line);
iss>>节点1>>节点2;
add_edge(node1,node2,G);}
cout <<< lt;写入文件<< endl;
int j = 0;
流出;
out.open(connected_component.txt);
std :: vector< int>组分(为num_vertices(G));
int num = connected_components(G,& component [0]);
std :: vector< int> :: size_type i;
for(i = 0; i!= component.size(); ++ i){
out<<我<< \ t<<< component< i><< endl;}
out.close();
}

任何关于如何通过boost来做到这一点的想法?或更改我的图形数据类型?

解决方案

使用随机图形数据,我可以在37秒左右运行4000万条边(根据Massif调整到4.4GiB的内存)。


/ tmp $ od -Anone -w4 -t u2 -v / dev / urandom | head -n 40000000> pairs.txt

/ tmp $ 时间./测试




 阅读40000000完成5543ms 
Building在3425ms完成的图形
在8957ms完成的算法
写入文件
在52ms完成写入

真实0m37.339s
用户0m36.078s
sys 0m1.202s



1。内存分配



请注意,我通过使用矢量边界列表来调整它,以便我可以保留所需的容量:

  typedef adjacency_list< listS,vecS,undirectedS,no_property,no_property,
no_property,vecS>图形;




  • 通过删除重新分配来增强负载性能
  • 减少堆碎片


2。顶点id缩放



一个重要的注意事项是存储需求随着顶点的数量进行缩放。更具体地说,它们用顶点的域进行缩放。例如。加载一个像这样的文件:

  1 7 
2 7
5 6
4 9

的内存要求要低于

  1 70000 
2 70000
5 60000
4 90000
code>

实际上,使用完全相同的输入重新运行上述基准,但仅第一行更改from

  47662 60203 

转换为

  476624766 602036020 

结果如下:

 读取40000000在5485ms完成
tcmalloc:大ALLOC 14448869376个字节== 0x7c0f2000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a(无)
栋图在6754ms
tcmalloc完成:大ALLOC 2408144896个字节== 0x49fe46000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a(零)
tcmalloc:larg e alloc 2408144896 bytes == 0x52ffd0000 @ 0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a(nil)
算法在31644ms完成
写入文件
在75921ms写入完成

真正2m20。 318s
user 1m30.224s
sys 0m49.821s

正如您所见谷歌的malloc实现(来自 gperftools )甚至会对特别大的分配提出警告,事实上,它的运行速度要慢得多。 (哦,内存使用量变得很大,Massif不再使用了,但是我已经看到它在htop中达到了23GiB)。

完整代码

查看 直播Coliru 在4000条边上运行:

  #include< boost / graph / adjacency_list.hpp> 
#include< boost / graph / connected_components.hpp>
#include< fstream>
#include< iostream>

#include< chrono>

使用Clock = std :: chrono :: high_resolution_clock;

int main()
{
using namespace boost;
typedef adjacency_list< listS,vecS,undirectedS,no_property,no_property,no_property,vecS>图形;
图形G;

//读取边缘
auto start = Clock :: now();
std :: ifstream infile(pairs.txt,std :: ios :: binary);

std :: vector< std :: pair< int,int> > as_read;

int node1,node2; (infile>> node1>> node2)
as_read.emplace_back(node1,node2);

std :: cout<< 阅读<< as_read.size()<< 完成<< std :: chrono :: duration_cast< std :: chrono :: milliseconds>(Clock :: now() - start).count()<< ms\\\
;
start = Clock :: now();

//构建图
G.m_edges.reserve(as_read.size()); (auto& pair:as_read)
add_edge(pair.first,pair.second,G);

std :: cout<< 在...中完成构建图<< std :: chrono :: duration_cast< std :: chrono :: milliseconds>(Clock :: now() - start).count()<< ms\\\
;
start = Clock :: now();

//找到连接的组件
std :: vector< int>组分(为num_vertices(G));
int num = connected_components(G,& component [0]);

std :: cout<< 算法完成<< std :: chrono :: duration_cast< std :: chrono :: milliseconds>(Clock :: now() - start).count()<< ms\\\
;
start = Clock :: now();

//写输出
std :: cout<<写文件<< std :: endl;

std :: ofstream out;
out.open(connected_component.txt);
for(size_t i = 0; i!= component.size(); ++ i){
out<<我<< \t<<分量[i] <<的std :: ENDL;
}

out.close();
std :: cout<< 写入完成<< std :: chrono :: duration_cast< std :: chrono :: milliseconds>(Clock :: now() - start).count()<< ms\\\
;
}


I have wrote a code for finding connected component of a very big Graph (80 million edges) but it doesn't work, When the edges number is close to 40 million it crashed.

int main(){
    using namespace boost;
    {
        int node1,node2;
        typedef adjacency_list <vecS, vecS, undirectedS> Graph;
        Graph G;
        std::ifstream infile("pairs.txt");
        std::string line;
        while (std::getline(infile,line))
        {
            std::istringstream iss(line);
            iss >> node1 >> node2;
            add_edge(node1, node2, G);}
            cout <<"writing file"<<endl;
            int  j = 0;
            ofstream out;
            out.open("connected_component.txt");
            std::vector<int> component(num_vertices(G));
            int num = connected_components(G, &component[0]);
            std::vector<int>::size_type i;
            for (i = 0; i != component.size(); ++i){
                out << i << "\t "<<component[i] <<endl;}
                out.close();
            }

Any idea of how can I do this with boost ? or change my Graph data type ?

解决方案

Using randomized graph data, I can run 40 million edges in about 37s (peaking at 4.4GiB of memory according to Massif).

/tmp$ od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
/tmp$ time ./test

Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms

real    0m37.339s
user    0m36.078s
sys 0m1.202s

1. Memory allocations

Note however, that I tweaked it by using a vector for the edge list, so that I can reserve the capacity required:

typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, 
         no_property, vecS> Graph;

This

  • enhanced load performance by removing reallocations
  • reduces heap fragmentation

2. Vertex id scaling

An important note, too, is that the storage requirements scale with the number of vertices. More specifically, they scale with the domain of vertices. E.g. loading a file like this:

1 7
2 7
5 6
4 9

will have vastly less memory requirements than

1 70000
2 70000
5 60000
4 90000

In fact, rerunning the above benchmark, with exactly the same input, but only the first line altered from

 47662 60203

into

 476624766 602036020

Results in the following output:

Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @  0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms

real    2m20.318s
user    1m30.224s
sys 0m49.821s

As you can see google's malloc implementation (from gperftools) even warns about exceptionally large allocations and indeed, it runs much slower. (Oh, and the memory usage becomes such that Massif doesn't make it anymore, but I've seen it hit 23GiB in htop).

Full Code

See it Live On Coliru running on 4000 edges:

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>

#include <chrono>

using Clock = std::chrono::high_resolution_clock;

int main()
{
    using namespace boost;
    typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
    Graph G;

    // read edges
    auto start = Clock::now();
    std::ifstream infile("pairs.txt", std::ios::binary);

    std::vector<std::pair<int, int> > as_read;

    int node1, node2;
    while (infile >> node1 >> node2)
        as_read.emplace_back(node1, node2);

    std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // build graph
    G.m_edges.reserve(as_read.size());
    for(auto& pair : as_read)
        add_edge(pair.first,pair.second,G);

    std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // find connected components
    std::vector<int> component(num_vertices(G));
    int num = connected_components(G, &component[0]);

    std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // write output
    std::cout <<"writing file"<<std::endl;

    std::ofstream out;
    out.open("connected_component.txt");
    for (size_t i = 0; i != component.size(); ++i) {
        out << i << "\t "<< component[i] << std::endl; 
    }

    out.close();
    std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}

这篇关于使用Boost查找大图的连接组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆