重复过程中遵循一个网站链接(BeautifulSoup) [英] Repetitive process to follow links in a website (BeautifulSoup)
问题描述
我用Python写一个code让所有的'a'使用美丽的汤在URL标记,然后我用3位的链接,那么我应该遵循的链接,我会重复这个过程约18倍。我包括低于code,其具有重复两次该过程。我不能去约的方式来重复同样的过程18次在loop.Any帮助将是AP preciated。
进口重
进口的urllib从BeautifulSoup进口*
htm1 =了urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html').read()
汤= BeautifulSoup(htm1)
标签=汤('A')
list1的=名单()
在标签标签:
X = tag.get('href属性,无)
list1.append(X)M = list1的[2]HTM2 =了urllib.urlopen(M).read()
汤= BeautifulSoup(HTM2)
tags1 =汤('A')
列表2 =名单()
在tags1标签1:
X2 = tag1.get('href属性,无)
list2.append(2次)Y =列表2 [2]
打印ÿ
OK,我只是写了这个code,它的工作,但我得到的结果相同的4个环节。它看起来像有什么不对的循环(请注意:我试图循环4次)
进口重
进口的urllib
从BeautifulSoup进口*
list1的=名单()
URL ='https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html因为我在范围内(4):#重复4次
HTM2 =了urllib.urlopen(URL).read()
soup1 = BeautifulSoup(HTM2)
tags1 = soup1('A')
在tags1标签1:
X2 = tag1.get('href属性,无)
list1.append(2次)
Y = list1的[2]
如果len(X2)< 3:#没有第三链接
打破#退出循环
其他:
URL = Y
打印ÿ
我不能来关于一种方法来重复相同的过程18次在一个循环
块引用>要重复一些18次在Python中,你可以使用
_为在范围(18)
循环:#!的/ usr / bin中/ env的python2
从进口的urllib2的urlopen
从进口里urlparse urljoin
从BS4进口BeautifulSoup#$ PIP安装beautifulsoup4URL ='http://example.com'
对于_范围内(18):#重复18次
汤= BeautifulSoup(的urlopen(URL))
A = soup.find_all('A'中,href = TRUE)#所有< A HREF>链接
如果len(一)LT; 3:#没有第三链接
打破#退出循环
URL = urljoin(URL中,[2] ['HREF'])#3链接,注意事项:忽略<基本href>I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3, then I should follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated twice. I can't come about a way to repeat the same process 18 times in a loop.Any help would be appreciated.
import re import urllib from BeautifulSoup import * htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read() soup =BeautifulSoup(htm1) tags = soup('a') list1=list() for tag in tags: x = tag.get('href', None) list1.append(x) M= list1[2] htm2= urllib.urlopen(M).read() soup =BeautifulSoup(htm2) tags1 = soup('a') list2=list() for tag1 in tags1: x2 = tag1.get('href', None) list2.append(x2) y= list2[2] print y
OK, I just wrote this code, it's working but I get the same 4 links in the results. It looks like there is something wrong in the loop (please note: I'm trying the loop 4 times)
import re import urllib from BeautifulSoup import * list1=list() url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html' for i in range (4): # repeat 4 times htm2= urllib.urlopen(url).read() soup1=BeautifulSoup(htm2) tags1= soup1('a') for tag1 in tags1: x2 = tag1.get('href', None) list1.append(x2) y= list1[2] if len(x2) < 3: # no 3rd link break # exit the loop else: url=y print y
解决方案I can't come about a way to repeat the same process 18 times in a loop.
To repeat something 18 times in Python, you could use
for _ in range(18)
loop:#!/usr/bin/env python2 from urllib2 import urlopen from urlparse import urljoin from bs4 import BeautifulSoup # $ pip install beautifulsoup4 url = 'http://example.com' for _ in range(18): # repeat 18 times soup = BeautifulSoup(urlopen(url)) a = soup.find_all('a', href=True) # all <a href> links if len(a) < 3: # no 3rd link break # exit the loop url = urljoin(url, a[2]['href']) # 3rd link, note: ignore <base href>
这篇关于重复过程中遵循一个网站链接(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!