好图遍历算法 [英] Good graph traversal algorithm

查看:155
本文介绍了好图遍历算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要问题:我有大约25万个节点的曲线图,平均连接是10左右。找到一个节点的连接是一个漫长的过程(10秒可以说)。保存一个节点到数据库也大约需要10秒。我可以检查是否一个节点已present在db非常快。允许并发,但没有在同一时间超过10长的要求,你会怎么遍历图来获得最高的覆盖面是最快的。

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

具体的问题:我想要刮网站用户页面。要发现新的用户,我从获取已知用户的好友列表。我已经进口了约图表的10%,但我一直陷在周期或使用太多内存记忆太多的节点。

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

我目前的执行情况:

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

目前实施的问题:

Problems of current implementation :

  • 在卡中,我已经进口,从而浪费时间拉帮结派和进口线程是空闲的。
  • 将增加更多,因为他们得到指出。

因此​​,边际改进措施是值得欢迎的,以及完整的重写。谢谢!

So, marginal improvments are welcome, as well as full rewrites. Thanks!

推荐答案

要记住你已经访问过的用户ID,则需要25万整数长度的地图。这是远离吃不消。只是保持这样的地图,并只有通过导致已经未被发现用户边缘遍历,将它们加入到该地图可以找到这样的边缘的点

To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.

据我所看到的,你接近实现广度优先搜索(BFS)。检查谷歌对这个算法的细节。当然,不要忘了互斥 - 你需要它们

As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.

这篇关于好图遍历算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆