替代“中"的有效替代方案. [英] Efficient Alternative to "in"

查看:105
本文介绍了替代“中"的有效替代方案.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Web爬虫,其最终目的是创建该爬虫所走路径的地图.虽然我不知道其他速率是什么,最肯定是最好的爬虫将页面拉下来,但我的时钟大约每分钟2,000页.

I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.

搜寻器采用递归回溯算法,但深度限制为15. 此外,为了防止我的搜寻器无休止地重新访问页面,它会将访问过的每个页面的URL存储在一个列表中,并在该列表中检查下一个候选URL.

The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15. Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.

for href in tempUrl:
    ...
    if href not in urls:
         collect(href,parent,depth+1)

这种方法在拉下约30万页时似乎已成为一个问题.到目前为止,爬虫平均每分钟已记录500页.

This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.

所以我的问题是,在提高效率的同时实现相同功能的另一种方法是什么.

So my question is, what is another method of achieving the same functionality while improving its efficiency.

我认为减小每个条目的大小可能会有所帮助,因此,我没有附加整个URL,而是将每个URL的前两个和后一个附加为字符串.但是,这没有帮助.

I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.

有没有办法用布景之类的东西呢?

Is there a way I could do this with sets or something?

感谢您的帮助

作为旁注,我的程序尚未使用多线程.我认为在学习线程之前应该解决这个瓶颈.

edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.

推荐答案

也许您可以使用 list 表示您到目前为止所看到的网址.

Perhaps you could use a set instead of a list for the urls that you have seen so far.

这篇关于替代“中"的有效替代方案.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆