重复过程中遵循一个网站链接(BeautifulSoup) [英] Repetitive process to follow links in a website (BeautifulSoup)

查看:161
本文介绍了重复过程中遵循一个网站链接(BeautifulSoup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用Python写一个code让所有的'a'使用美丽的汤在URL标记,然后我用3位的链接,那么我应该遵循的链接,我会重复这个过程约18倍。我包括低于code,其具有重复两次该过程。我不能去约的方式来重复同样的过程18次在loop.Any帮助将是AP preciated。

 进口重
进口的urllib从BeautifulSoup进口*
htm1 =了urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html').read()
汤= BeautifulSoup(htm1)
标签=汤('A')
list1的=名单()
在标签标签:
    X = tag.get('href属性,无)
    list1.append(X)M = list1的[2]HTM2 =了urllib.urlopen(M).read()
汤= BeautifulSoup(HTM2)
tags1 =汤('A')
列表2 =名单()
在tags1标签1:
    X2 = tag1.get('href属性,无)
    list2.append(2次)Y =列表2 [2]
打印ÿ

OK,我只是写了这个code,它的工作,但我得到的结果相同的4个环节。它看起来像有什么不对的循环(请注意:我试图循环4次)

 进口重
进口的urllib
从BeautifulSoup进口*
list1的=名单()
URL ='https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html因为我在范围内(4):#重复4次
    HTM2 =了urllib.urlopen(URL).read()
    soup1 = BeautifulSoup(HTM2)
    tags1 = soup1('A')
    在tags1标签1:
        X2 = tag1.get('href属性,无)
        list1.append(2次)
    Y = list1的[2]
    如果len(X2)< 3:#没有第三链接
        打破#退出循环
    其他:
        URL = Y
    打印ÿ


解决方案

  

我不能来关于一种方法来重复相同的过程18次在一个循环


要重复一些18次在Python中,你可以使用 _为在范围(18)循环:

 #!的/ usr / bin中/ env的python2
从进口的urllib2的urlopen
从进口里urlparse urljoin
从BS4进口BeautifulSoup#$ PIP安装beautifulsoup4URL ='http://example.com'
对于_范围内(18):#重复18次
    汤= BeautifulSoup(的urlopen(URL))
    A = soup.find_all('A'中,href = TRUE)#所有< A HREF>链接
    如果len(一)LT; 3:#没有第三链接
        打破#退出循环
    URL = urljoin(URL中,[2] ['HREF'])#3链接,注意事项:忽略<基本href>

I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3, then I should follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated twice. I can't come about a way to repeat the same process 18 times in a loop.Any help would be appreciated.

import re
import urllib

from BeautifulSoup import *
htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read()
soup =BeautifulSoup(htm1)
tags = soup('a')
list1=list()
for tag in tags:
    x = tag.get('href', None)
    list1.append(x)

M= list1[2]

htm2= urllib.urlopen(M).read()
soup =BeautifulSoup(htm2)
tags1 = soup('a')
list2=list()
for tag1 in tags1:
    x2 = tag1.get('href', None)
    list2.append(x2)

y= list2[2]
print y

OK, I just wrote this code, it's working but I get the same 4 links in the results. It looks like there is something wrong in the loop (please note: I'm trying the loop 4 times)

import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'

for i in range (4):  # repeat 4 times
    htm2= urllib.urlopen(url).read()
    soup1=BeautifulSoup(htm2)
    tags1= soup1('a')
    for tag1 in tags1:
        x2 = tag1.get('href', None)
        list1.append(x2)
    y= list1[2]
    if len(x2) < 3:  # no 3rd link
        break  # exit the loop
    else:
        url=y             
    print y

解决方案

I can't come about a way to repeat the same process 18 times in a loop.

To repeat something 18 times in Python, you could use for _ in range(18) loop:

#!/usr/bin/env python2
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = 'http://example.com'
for _ in range(18):  # repeat 18 times
    soup = BeautifulSoup(urlopen(url))
    a = soup.find_all('a', href=True)  # all <a href> links
    if len(a) < 3:  # no 3rd link
        break  # exit the loop
    url = urljoin(url, a[2]['href'])  # 3rd link, note: ignore <base href>

这篇关于重复过程中遵循一个网站链接(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆