使用python抓取网站 [英] Use python to crawl a website

查看:117
本文介绍了使用python抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在寻找一种动态的方式来抓取网站并从每个页面获取链接。我决定尝试Beauitfulsoup。两个问题:如何使用嵌套的while语句搜索链接,更加动态地执行此操作。我想从这个网站获得所有链接。但我不想继续使用嵌套的while循环。

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

我的第二个问题无论如何要告诉我是否从网站获得了所有链接。请原谅我,我对python(一年左右)有些新意,我知道我的一些进程和逻辑可能是幼稚的。但我必须以某种方式学习。主要是我只想使用嵌套的while循环更加动态。提前感谢您的任何见解。

My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.

推荐答案

抓住网站并获取所有链接的问题是一个常见问题。如果您在Google上搜索蜘蛛网站python,您可以找到可以为您执行此操作的库。这是我发现的一个:

The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:

http://pypi.python.org/pypi/spider.py/0.5

更好的是,Google发现这个问题已在StackOverflow上提出并回答:

Even better, Google found this question already asked and answered here on StackOverflow:

任何人都知道我可以使用基于Python的优秀网络爬虫吗?

这篇关于使用python抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆