使用python抓取网站 [英] Use python to crawl a website
问题描述
所以我正在寻找一种动态的方式来抓取网站并从每个页面获取链接。我决定尝试Beauitfulsoup。两个问题:如何使用嵌套的while语句搜索链接,更加动态地执行此操作。我想从这个网站获得所有链接。但我不想继续使用嵌套的while循环。
So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
我的第二个问题无论如何要告诉我是否从网站获得了所有链接。请原谅我,我对python(一年左右)有些新意,我知道我的一些进程和逻辑可能是幼稚的。但我必须以某种方式学习。主要是我只想使用嵌套的while循环更加动态。提前感谢您的任何见解。
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.
推荐答案
抓住网站并获取所有链接的问题是一个常见问题。如果您在Google上搜索蜘蛛网站python,您可以找到可以为您执行此操作的库。这是我发现的一个:
The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
更好的是,Google发现这个问题已在StackOverflow上提出并回答:
Even better, Google found this question already asked and answered here on StackOverflow:
这篇关于使用python抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!