好图遍历算法 [英] Good graph traversal algorithm
问题描述
摘要问题:我有大约25万个节点的曲线图,平均连接是10左右。找到一个节点的连接是一个漫长的过程(10秒可以说)。保存一个节点到数据库也大约需要10秒。我可以检查是否一个节点已present在db非常快。允许并发,但没有在同一时间超过10长的要求,你会怎么遍历图来获得最高的覆盖面是最快的。
Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.
具体的问题:我想要刮网站用户页面。要发现新的用户,我从获取已知用户的好友列表。我已经进口了约图表的10%,但我一直陷在周期或使用太多内存记忆太多的节点。
Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.
我目前的执行情况:
def run() :
import_pool = ThreadPool(10)
user_pool = ThreadPool(1)
do_user("arcaneCoder", import_pool, user_pool)
def do_user(user, import_pool, user_pool) :
id = user
alias = models.Alias.get(id)
# if its been updates in the last 7 days
if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
sys.stderr.write("Skipping: %s\n" % user)
else :
sys.stderr.write("Importing: %s\n" % user)
while import_pool.num_jobs() > 20 :
print "Too many queued jobs, sleeping"
time.sleep(15)
import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))
sys.stderr.write("Crawling: %s\n" % user)
users = crawl(id, 5)
if len(users) >= 2 :
for user in random.sample(users, 2) :
if (user_pool.num_jobs() < 100) :
user_pool.add_job(do_user, [user, import_pool, user_pool])
def crawl(id, limit=50) :
'''returns the first 'limit' friends of a user'''
*not relevant*
目前实施的问题:
Problems of current implementation :
- 在卡中,我已经进口,从而浪费时间拉帮结派和进口线程是空闲的。
- 将增加更多,因为他们得到指出。
因此,边际改进措施是值得欢迎的,以及完整的重写。谢谢!
So, marginal improvments are welcome, as well as full rewrites. Thanks!
推荐答案
要记住你已经访问过的用户ID,则需要25万整数长度的地图。这是远离吃不消。只是保持这样的地图,并只有通过导致已经未被发现用户边缘遍历,将它们加入到该地图可以找到这样的边缘的点
To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.
据我所看到的,你接近实现广度优先搜索(BFS)。检查谷歌对这个算法的细节。当然,不要忘了互斥 - 你需要它们
As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.
这篇关于好图遍历算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!