好图遍历算法 [英] Good graph traversal algorithm

查看：155 发布时间：2015/11/30 14:59:28 python performance algorithm language-agnostic graph-traversal

本文介绍了好图遍历算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

摘要问题：我有大约25万个节点的曲线图，平均连接是10左右。找到一个节点的连接是一个漫长的过程（10秒可以说）。保存一个节点到数据库也大约需要10秒。我可以检查是否一个节点已present在db非常快。允许并发，但没有在同一时间超过10长的要求，你会怎么遍历图来获得最高的覆盖面是最快的。

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

具体的问题：我想要刮网站用户页面。要发现新的用户，我从获取已知用户的好友列表。我已经进口了约图表的10％，但我一直陷在周期或使用太多内存记忆太多的节点。

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

我目前的执行情况：

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

目前实施的问题：

Problems of current implementation :

在卡中，我已经进口，从而浪费时间拉帮结派和进口线程是空闲的。
将增加更多，因为他们得到指出。

因此，边际改进措施是值得欢迎的，以及完整的重写。谢谢！

So, marginal improvments are welcome, as well as full rewrites. Thanks!

好图遍历算法 [英] Good graph traversal algorithm

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

好图遍历算法 [英] Good graph traversal algorithm

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭