如何循环访问标签并重定向以检索更多标签? [英] How to loop through a tags and redirect to retrieve more a tags?

查看:83
本文介绍了如何循环访问标签并重定向以检索更多标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了教育目的,我试图编写一个程序,提示用户输入url,count和position。 网址将被刮掉,网址中的标签将被检索,这将产生一个标签列表。然后使用位置从先前检索的标签列表中选择一个新链接,并将其用作要被抓取的新url。 计数是这个过程发生的次数。

 代码:
导入urllib
从bs4导入BeautifulSoup作为bfs

#声明全局变量
href_list = []
no_iterations = 0

#提示用户输入
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')

#While while with condition
while no_iterations!= int(count):
no_iterations + = 1

#刮取网址
html = urllib.urlopen(url).read()
soup = bfs(html)

#检索所有定位标记
tags = soup('a')
标记中的标记:
href_list .append(tag.get('href',None))

#Assiginig new url $ b $ url url = href_list [int(position)-1]

#打印信息给用户
print'正在检索:',href_list [int(position)-1]
print'Last Url:',href_list [int(pos ition)-1]

当我在这里运行程序时,我得到了:

 输入网址 -  http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
输入位置 - 3

检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr-chuck.com/tsugi/mod /python-data/data/known_by_Montgomery.html
正在检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
上次访问的网址:http: //pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html



<通过观察输出,我可以看到网址没有被重置,因为它应该,任何意见赞赏。

我通过重置列表解决了存储检索到的标记
的代码:

 从bs4导入urllib 
导入BeautifulSoup作为bfs

#声明全局变量
href_list = []
no_iterations = 0

#提示用户输入
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input( '输入位置')

#while循环条件
而no_iterations!= int(count):
no_iterations + = 1

#刮url
html = urllib.urlopen(url).read()
soup = bfs(html)

#检索所有锚定标记
tags = soup ('a')
标签中的标签:
href_list.append(tag.get('href',None))

#Assiginig new url
url = href_list [int(position)-1]
href_list = []
#打印用户信息
print'正在检索:',href_list [int(position)-1]
print'Last Url:',url

所以现在新的输出是:

 输入url  -  http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
输入位置 - 3
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
检索:http://pr4e.dr -chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
检索:http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
上一个网址:http://pr4e.dr-chuck.com/tsugi /mod/python-data/data/known_by_Anayah.html

感谢您的支持


For educational purposes I am trying to write a program that would prompt the user for "url", "count" and "position". The "url" will be scraped and "a tags" within the "url" will be retrieved and this would yield a list of "a tags". The "position" is then used to select a new link from the list of "a tags" previously retrieved and use it as the new "url" to be scraped. "Count" is the number of times this process takes place.

Code:
import urllib
from bs4 import BeautifulSoup as bfs

# Declare global variables
href_list = []
no_iterations = 0

# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')

# While loop with condition
while no_iterations != int(count):
    no_iterations += 1

    # Scraping the url 
    html = urllib.urlopen(url).read()
    soup = bfs(html)

    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href_list.append(tag.get('href', None))

    # Assiginig new url
    url = href_list[int(position)-1]

    # Printing info for user
    print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', href_list[int(position)-1]

When I run the program here is what I get:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
Enter position - 3 

Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html

By observing the output, I can see that the URL is not reset as it should, any advice is appreciated.

解决方案

I solved by resetting the list were I stored the retrieved a tags Code:

import urllib
from bs4 import BeautifulSoup as bfs

# Declare global variables
href_list = []
no_iterations = 0

# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')

# While loop with condition
    while no_iterations != int(count):
    no_iterations += 1

    # Scraping the url 
    html = urllib.urlopen(url).read()
    soup = bfs(html)

    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href_list.append(tag.get('href', None))

    # Assiginig new url
    url = href_list[int(position)-1]
    href_list = []
    # Printing info for user
    print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', url

So the new output now is:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html

Thanks for your support

这篇关于如何循环访问标签并重定向以检索更多标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆