如何以递归方式使用Beautiful Soup(python)从网站获取所有链接 [英] How to get all links from website using Beautiful Soup (python) Recursively

查看:132
本文介绍了如何以递归方式使用Beautiful Soup(python)从网站获取所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够递归地从网站上获取所有链接,然后关注这些链接并从这些网站上获取所有链接.深度应为5-10,以便它返回找到的所有链接的数组.最好使用漂亮的汤/蟒蛇.谢谢!

I want to be able to recursively get all links from a website then follow those links and get all links from those websites. The depth should be 5-10 so that it returns back a an array of all links that it finds. Preferably using beautiful soup/python. Thanks!

到目前为止,我已经尝试过了,但是它没有用....任何帮助将不胜感激.

I have tried this so far and it is not working....any help will be appreciated.

from BeautifulSoup import BeautifulSoup
import urllib2

def getLinks(url):
    if (len(url)==0):
        return [url]
    else:
        files = [ ]
        page=urllib2.urlopen(url)
        soup=BeautifulSoup(page.read())
        universities=soup.findAll('a',{'class':'institution'})
        for eachuniversity in universities:
           files+=getLinks(eachuniversity['href'])
        return files

print getLinks("http://www.utexas.edu/world/univ/alpha/")

推荐答案

递归算法用于将大问题减少为结构相同的小问题,然后组合结果.它们通常由不导致递归的基本情况和导致递归的另一个情况组成.例如,假设您出生于1986年,并且想要计算自己的年龄.您可以这样写:

Recursive algorithms are used to reduce big problems to smaller ones which have the same structure and then combine the results. They are often composed by a base case which doesn't lead to recursion and another case that leads to recursion. For example, say you were born at 1986 and you want to calculate your age. You could write:

def myAge(currentyear):
    if currentyear == 1986: #Base case, does not lead to recursion.
        return 0
    else:                   #Leads to recursion
        return 1+myAge(currentyear-1)

我(我)自己,在您的问题中真正看不到使用递归的意义.我的建议是首先对代码进行限制.您提供给我们的内容将无限期运行,因为该程序陷入了无限嵌套的循环中.它永远不会结束,并开始返回.因此,您可以在函数外部设置一个变量,该变量在每次下降到某个级别时都会更新,并且在某个特定点使该函数停止启动新的for循环并开始返回找到的结果.

I, myself, don't really see the point in using recursion in your problem. My suggestion is first that you put a limit in your code. What you gave us will just run infinately, because the program gets stuck in infinately nested for loops; it never reaches an end and starts returning. So you can have a variable outside the function that updates every time you go down a level and at a certain point stops the function from starting a new for-loop and starts returning what it has found.

但是随后您要更改全局变量,您以一种奇怪的方式使用了递归,并且代码变得混乱了.

But then you are getting into changing global variables, you are using recursion in a strange way and the code gets messy.

现在,请阅读注释并了解您真正想要的是什么,我必须说,这还不是很清楚,您可以在代码中使用递归算法的帮助,但不能递归编写所有内容.

Now reading the comments and seeng what you really want, which, I must say, is not really clear, you can use help from a recursive algorithm in your code, but not write all of it recursively.

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a', {'class':'institution'})
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

现在这仍然存在问题:链接并不总是链接到网页,而是也链接到文件和图像.这就是为什么我在"url-opening"函数的递归部分中编写了if/else语句的原因.另一个问题是您的第一个网站具有2166个机构链接,并且创建2166 * 5 beautifulSoups的速度并不快.上面的代码运行2166次递归函数.那不应该是问题,但是您要处理大型html(或php等)文件,因此制作2166 * 5的汤需要大量时间.

Now there is still a problem with this: links are not always linked to webpages, but also to files and images. That's why I wrote the if/else statement in the recursive part of the 'url-opening' function. The other problem is that your first website has 2166 institution links, and creating 2166*5 beautifulSoups is not fast. The code above runs a recursive function 2166 times. That shouldn't be a problem but you are dealing with big html(or php whatever) files so making a soup of 2166*5 takes a huge amount of time.

这篇关于如何以递归方式使用Beautiful Soup(python)从网站获取所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆