如何循环访问标签并重定向以检索更多标签? [英] How to loop through a tags and redirect to retrieve more a tags?
问题描述
为了教育目的,我试图编写一个程序,提示用户输入url,count和position。 网址将被刮掉,网址中的标签将被检索,这将产生一个标签列表。然后使用位置从先前检索的标签列表中选择一个新链接,并将其用作要被抓取的新url。 计数是这个过程发生的次数。
代码:
导入urllib
从bs4导入BeautifulSoup作为bfs
#声明全局变量
href_list = []
no_iterations = 0
#提示用户输入
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')
#While while with condition
while no_iterations!= int(count):
no_iterations + = 1
#刮取网址
html = urllib.urlopen(url).read()
soup = bfs(html)
#检索所有定位标记
tags = soup('a')
标记中的标记:
href_list .append(tag.get('href',None))
#Assiginig new url $ b $ url url = href_list [int(position)-1]
#打印信息给用户
print'正在检索:',href_list [int(position)-1]
print'Last Url:',href_list [int(pos ition)-1]
当我在这里运行程序时,我得到了:
输入网址 - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
输入位置 - 3
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr-chuck.com/tsugi/mod /python-data/data/known_by_Montgomery.html
正在检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
上次访问的网址:http: //pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
<通过观察输出,我可以看到网址没有被重置,因为它应该,任何意见赞赏。
的代码:
从bs4导入urllib
导入BeautifulSoup作为bfs
#声明全局变量
href_list = []
no_iterations = 0
#提示用户输入
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input( '输入位置')
#while循环条件
而no_iterations!= int(count):
no_iterations + = 1
#刮url
html = urllib.urlopen(url).read()
soup = bfs(html)
#检索所有锚定标记
tags = soup ('a')
标签中的标签:
href_list.append(tag.get('href',None))
#Assiginig new url
url = href_list [int(position)-1]
href_list = []
#打印用户信息
print'正在检索:',href_list [int(position)-1]
print'Last Url:',url
所以现在新的输出是:
输入url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
输入位置 - 3
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr -chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
上一个网址:http://pr4e.dr-chuck.com/tsugi /mod/python-data/data/known_by_Anayah.html
感谢您的支持
For educational purposes I am trying to write a program that would prompt the user for "url", "count" and "position". The "url" will be scraped and "a tags" within the "url" will be retrieved and this would yield a list of "a tags". The "position" is then used to select a new link from the list of "a tags" previously retrieved and use it as the new "url" to be scraped. "Count" is the number of times this process takes place.
Code:
import urllib
from bs4 import BeautifulSoup as bfs
# Declare global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')
# While loop with condition
while no_iterations != int(count):
no_iterations += 1
# Scraping the url
html = urllib.urlopen(url).read()
soup = bfs(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href_list.append(tag.get('href', None))
# Assiginig new url
url = href_list[int(position)-1]
# Printing info for user
print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', href_list[int(position)-1]
When I run the program here is what I get:
Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
By observing the output, I can see that the URL is not reset as it should, any advice is appreciated.
I solved by resetting the list were I stored the retrieved a tags Code:
import urllib
from bs4 import BeautifulSoup as bfs
# Declare global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')
# While loop with condition
while no_iterations != int(count):
no_iterations += 1
# Scraping the url
html = urllib.urlopen(url).read()
soup = bfs(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href_list.append(tag.get('href', None))
# Assiginig new url
url = href_list[int(position)-1]
href_list = []
# Printing info for user
print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', url
So the new output now is:
Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
Thanks for your support
这篇关于如何循环访问标签并重定向以检索更多标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!